International Workshop New Challenges for …...Revista Română de Statistică nr. 2 / 2014 5 The...

International Workshop New Challenges for

Statistical Software - The Use of R in

Offi cial Statistics (2014, 27th of March)

Romanian Statistical Review nr. 2 / 20142

Revista Română de Statistică nr. 2 / 2014 3

THE WORLD OF STATISTICS 5

R – A GLOBAL SENSATION IN DATA SCIENCE Nicoleta Caragea Antoniade-Ciprian Alexandru Ecological University of Bucharest - Faculty of Economics Ana Maria Dobre National Institute of Statistics, Romania 7

STATISTICAL DATA ANALYSIS VIA R AND PHP: A CASE STUDY OF THE RELATIONSHIP BETWEEN GDP AND FOREIGN DIRECT INVESTMENTS FOR THE REPUBLIC OF MOLDOVA

PhD Candidate Ştefan Cristian CIUCU The Bucharest University of Economics Studies, Romania 17

CREATING STATISTICAL REPORTS IN THE PAST, PRESENT AND FUTURE

PhD candidate Gergely Daróczi BCE, Hungary 31

THE PROGRESS OF R IN ROMANIAN OFFICIAL STATISTICS Ana Maria Dobre National Institute of Statistics, Romania Cecilia Roxana Adam National Institute of Economic Reasearch “Costin C. Kiritescu” of the

Romanian Academy, Romania 45

MULTILEVEL MODEL ANALYSIS USING R Nicolae-Marius JULA Nicolae Titulescu University of Bucharest 55

DEMOGRAPHIC RESEARCH ON THE SOCIO ECONOMIC BACKGROUND OF STUDENTS OF THE ECOLOGICAL UNIVERSITY OF BUCHAREST

Ph.D. Janina Mihaela Mihăilă Ecological University of Bucharest 67

SUMAR / CONTENTS 2/2014


INTEGRATING R AND HADOOP FOR BIG DATA ANALYSIS Bogdan Oancea “Nicolae Titulescu” University of Bucharest Raluca Mariana Dragoescu The Bucharest University of Economic Studies 83

METHODOLOGICAL CONSIDERATIONS ON THE SIZE OF COEFFICIENT OF INTENSITY OF STRUCTURAL CHANGES (CISC)

Dr. Florin Marius Pavelescu Institute of National Economy 95

USING R TO GET VALUE OUT OF PUBLIC DATA PhD Candidate Marius Radu PhD Assistant Ioana Mureşan PhD Professor Răzvan Nistor Babeş-Bolyai University, Faculty of Economics and Business Administration 109

DATA EDITING AND IMPUTATION IN BUSINESS SURVEYS USING “R” Elena Romascanu National Institute of Statistics, Romania 129

THE BAYESIAN MODELLING OF INFLATION RATE IN ROMANIA PhD Senior Researcher Mihaela Simionescu (Bratu) Institute for Economic Forecasting of the Romanian Academy 147 ESTIMATION PROCEDURE IN MONTHLY RETAIL TRADE

SURVEY IN SERBIA USING R SOFTWARE Sofi ja Suvocarev Statistical Offi ce of the Republic of Serbia 161

DEVELOPMENT AND CURRENT PRACTICE IN USING R AT STATISTICS AUSTRIA

Matthias Templ Statistics Austria, Vienna University of Technology Alexander Kowarik Bernhard Meindl Statistics Austria 173

USING R AS AN ALTERNATIVE TEACHING TOOL IN THE ECOLOGICAL UNIVERSITY OF BUCHAREST

Carmen Ungureanu Ecological University of Bucharest 185


The World of Statistics The International Workshop New Challenges for Statistical Software - The Use of R in Offi cial Statistics was the second of a series of events dedicated to the use of R Project in Romania and initiated by R-omanian R team. We are pleased to announce that 4th of April is the anniversary or R-omanian R team as an R User Group. One year ago, on 4th of April, was taking place the 1st workshop dedicated to the use of R – Workshop State-of-the-art statistical software commonly used in applied economics. The event International Workshop New Challenges for Statistical Software - The Use of R in Offi cial Statistics (2014, 27th of March) has been a real success, with participants from academia, offi cial statistics and business, from the following countries: UK, Austria, Netherlands, Hungary, Serbia and of course Romania. Quality of the works presented, high research activity carried out by the authors, and the results may be as many strong points of reference, to conclude that Romania is on the map of useRs. The presentations from the workshop are available on our website:

http://www.r-project.ro/workshop2014/presentations.html

The workshop was an opportunity to develop new ideas and cooperation in the fi eld of offi cial statistics and academia. The event enabled once again the signifi cant role of National Institute of Statistics in the offi cial statistics. We were very pleased to have as our guests some pioneers of R from other statistical offi ces from UK, Austria, Netherlands, Hungary and Serbia. R-omanian R team together with the organizers want to thank to all the participants for sharing their experience and knowledge.


R – a Global Sensation in Data ScienceNicoleta CARAGEA ([email protected])Antoniade-Ciprian ALEXANDRU ([email protected])Ecological University of Bucharest - Faculty of EconomicsAna Maria DOBRE ([email protected])National Institute of Statistics, Romania

ABSTRACT

The main objective of this paper is to expose the evolution of R, as the most used data analysis tool among statisticians and the academic researchers. Its fl exibility and complexity simply gained the statisticians and data scientists. The paper examines some of the reasons behind the popularity of R, using tools like SWOT analysis. R software environment offers integrated tools for a very large area of data analysis, from computations and data mining to high-effects visualization. As an ex-ample, we performed in this paper an illustration of 3D plotting. Keywords: R Software, R Packages, Statistics, Data Visualization, 3D Plotting JEL Classifi cation: C13, C18, C88

Introduction

Motto: “It is easy to lie with statistics. It is hard to tell the truth

without statistics” (Andrejs Dunkels)

Nowadays, R is the most used and appreciated tool in data science. The R system implements a dialect of the infl uential S language but has its own GUIs and IDEs. The fact that especially in Romania, but common also in other countries, most of private or public institutions use commercial statistical tools having a predictable cost, did not embarrassed the spread of R use and its growth. In this context, R itself concerned on a huge industry of data science, data mining, open data – both in the private and public sector and in many fi elds, such as statistics, medicine, biology, geographic information system, social media, marketing, fi nance, engineering and so on.


This paper represents a further research of the authors (Caragea et. al, 2012).

LITERATURE REVIEW

R fi rst appeared in 1996, when the statistics professors Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand released the code as a free software package, under GNU General Public License.. R is considered the lingua franca for statisticians and recently for data scientists. David Smith (2011), the Chief Community Offi cer of Revolution Analytics, considers that data science is a valuable rebranding of computer science and applied statististics skills. In fact, the terminology of data science is related to statistics, data mining, Exploratory Data Analysis, big data, artifi cial neural network, forecasting, decision tree. Nowadays, many companies hire “data scientists” and many conferences are held under aegis of “data science”.

SWOT ANALYSIS OF R PROJECT

A comprehensive and well-documented SWOT analysis of R software is necessary to understand its advantages and disadvantages and to defi ne a possible causality of R’s rapidly growth among data analysis tools, according to the illustration below.


SWOT Analysis of RFigure 1

Strengths:

Open-source programA fantastic user community that keeps growing

Weaknesses:

R keeps all the data in the RAM memory so it can consume very quickly the available memory

Opportunities:

R is the product of international joint of top computational statisticians and computer language designers

Threats:

It is considered by many to be harder to learn than other similar software due to the fact that it has more types of data structures than the data set

A detailed SWOT analysis is presented in the next section of the paper. STRENGTHS • Open-source program • R and its GUIs and IDEs are completely freeware; the cost of using

R are related only with training of users • R is cross-platform: it runs on Windows, Linux, Mac OSX • A fantastic user community that keeps growing • User support through a very active mailing list, blogs, dedicated

forums • Being a challenge for every user to involve himself and to exchange

knowledge • Continuous develop and release at academic level, growing list of

print books and e-books • Linked with the way statisticians think and work (e.g.: keeping the

track of missing values) • Meets the changing needs of shifting global economy because of its

fl exibility • Competitive tools for Geographic Information Systems • Operations Research; SPSS has not this issue available • R supports connection with the main commercial software, such as:


JMP, MATLAB, Spotfi re, SPSS, STATISTICA, Platform Symphony and SAS.

• The freedom to teach with real-world examples from outside orga ni-zations, which is forbidden to academics by SAS and SPSS licenses

• The fl exibility to mix-and-match models, scripts and packages for the best results

• R functions can nest inside one another, creating nearly infi nite combinations

• Easy to create scripts with all the steps for an analysis, and run the script from the command line or menus

• R is an object-oriented language and has the advantage of operating on an object according to methods that make sense and also the methods can adapt to the type of object

• Intermediate results can be reviewed, and scripts can be edited and run as batch processes

• R stimulates critical thinking about problem-solving rather than a “push the button” mentality

• Every computational step is recorded in the background, and this history can be saved for later use or documentation

• The possibility to transform R code into HTML code so that it could be published on web (via Rpubs)

• Turn analyses into interactive web applications that anyone can use (via Rshiny) without necessary HTML or JavaScript knowledge

• R allows importing data from Microsoft Excel, Microsoft Access, txt, SAS, SPSS, Visual Fox Pro, Oracle, MySQL, and many more formats

• R can handle a few millions of records on a regular PC, and there are some great packages that support handling larger data than the actual RAM

WEAKNESSES • Data collection should be available from other tools; MySQL or

PostgreSQL are popular among useRs for this purpose • R keeps all the data in the RAM memory so it can consume very

quickly the available memory • Direct Marketing not available • Guided Analytics not available • The help fi les and the vignettes for packages are written for relatively

advanced users; documentation is sometimes impenetrable to the non-statisticians

• R is not very user friendly and it needs basic knowledge of


programming language; that will limit R’s long-term growth because GUI users far outnumber programmers

• The default GUI of R is limited to simple interaction and does not include statistical procedures; the user must type commands for importing data, computing and plot graphs

OPPORTUNITIES • R is the product of international joint of top computational

statisticians and computer language designers • Users’ contribution to program’s ongoing development; anyone is

welcome to provide bug fi xes, code enhancements, and new packages • Share new techniques with other R users around the world via online

community • Re-use and reproduce new discovered techniques on analytic

operations that the user is going to perform • Very large area of use - statistics, business analytics, fi nance,

journalism, mapping, forecasting, social networking, spatial analysis, engineering, science, drug development, computational biology, and many more

• Easy to export results to usual formats and get data visualization like maps, 3D surfaces, image plots, scatter plots, histograms, bar plots, pie charts, multi-panel charts and many more

• IT skills in R are very appreciated on the labour market • R for mobile devices: R has version for OS X (“R Programming

Language” on Itunes), as well as a server-based implementation of RStudio for Android (“R Instructor” on Google Play)

• R supports big data and performs big data analysis • R supports multicore task distribution and parallel computing • R offers many facilities to learn basic statistics • Currently available CRAN Task Views on various topics and the

possibility to extend with some others

THREATS • It is considered by many to be harder to learn than other similar

software due to the fact that it has more types of data structures than the data set

• It is necessary for the user to carry out the macro language of R and to control the management of the output; SPSS and SAS allow user to skip those issues until he needs them


INTEREST OF USING R

The interest of using R can be quantifi ed by various tools such as Google Trends. Bob Muenchen is the author of r4stats.com, a blog which analyses the popularity of data analysis software. The evolution of R as the most used data analysis tool in the last decade is highlighted below by an analysis of R packages, given the fact that every package is a user contribution to the R system. The packages can be found mainly on the R-project website: http://cran.r-project.org/web/packages/available_packages_by_name.html. Figure 1 shows that the growth in R packages is following a rapid parabolic arc, specifi cally a quadratic fi t with R-squared=.998 (Muenchen). The trend is more spectacular given the fact that it represents only CRAN packages, but not the packages from other seven repositories of R, such as Bioconductor.

Available Packages on CRANFigure 2

Source: http://r4stats.com/articles/popularity/

As prove of the spread of R, we have analyzed on Google Trends the popularity of Google searches for the most used statistical software: R, Stata, SAS, SPSS and Eviews.


Interest over time of R, Stata, SAS, SPSS and EViews searches (2005-2015)

Figure 3

Source: Google Trends, http://www.google.com/trends/explore?hl=en-US#q=%2Fm%2F0212jm%2C%20%2Fm%2F05ymvd%2C%20%2Fm%2F01bp2d%2C%20SPSS%2C%20EViews&cmpt=q

As seen in the Figure 3, starting from 2005, R had a tremendous increase comparing to the other analyzed Google searches. Point A on the graph is the starting point for forecasting. The forecasts show that R will keep in line until April 2015 (the limit date for forecast). Google Trends show also that the countries more searching for R in the period January-April 2014 are the following: Iceland, India, Senegal and South Korea. In the following statement we present three examples on R’s 3d graphical capabilities, concerning the same data from Google Trends for the statistical software presented in the section above. The fi rst example is a 3d scatterplot created with scatterplot3d package (Ligges, 2003). > s3d < -scatterplot3d(R, SAS.Institute, spss, pch=20, highlight.3d=TRUE, type=”p”)


Interest over time of R, SAS and SPSS (2005-2014) according to Google Trends

Figure 4

The second example is a 3d plot created with rgl package (Adler et. al, 2014). > plot3d(R, SAS.Institute, Stata, col=rainbow(1000))

Interest over time of R, SAS and Stata (2005-2014) according to Google Trends

Figure 5

The third example is an interactive 3d plot created with car package ( Fox, 2011) that hasthe option to identify points by mouse clicking. This type of plot supports also regression models. > scatter3d(x=R, y= eviews, z= spss, size = 10)


Interest over time of R, EViews and SPSS (2005-2014) according to Google Trends

Figure 6

Overall, the fi gures presented above show an ascendent trend of the interest of R versus other statistical software.

CONCLUSIONS

R allows users and experts in specifi c fi elds of statistical computing and academic researchers to add new capabilities to the software. Is it not about writing new programs in R, but it is also convenient to combine related sets of programs, data, and documentation in R packages. More over, R is a full-fl edged programming language, with a rich complement of mathematical functions, matrix operations and control structures. In Romania is a small group in the offi cial statistics involved in small area estimation based on R technique. This year has seen a defi nite increase


in R-omanian team activity by extending R use in the frame of the research institutes of Romanian Academy, also in universities and business fi eld.

Acknowledgement The authors are members of the R-omanian R Team (www.r-project.ro) and they give their special gratitude to the other members and to everyone who made this project grow up.

References 1. Adler, D., Murdoch, D. and others (2014) rgl: 3D visualization device

system (OpenGL). R package version 0.93.996. http://CRAN.R-project.org/package=rgl

2. Caragea, N., Alexandru, A.C., Dobre, A.M. (2012) „Bringing New Opportunities to Develop Statistical Software and Data Analysis Tools in Romania”, The Proceedings of the VIth International Conference on Globalization and Higher Education in Economics and Business Administration, ISBN: 978-973-703-766-4, pp.450-456

3. Fox, J., Weisberg S. (2011) An {R} Companion to Applied Regression, Second Edition. Thousand Oaks CA: Sage. URL: http://socserv.socsci.mcmaster.ca/jfox/Books/Companion

4. Itunes Store, https://itunes.apple.com/gb/app/r-programming-language/id540809637?mt=8, accessed on 8th of April 2014

5. Ligges, U. and Mächler, M. (2003) Scatterplot3d - an R Package for Visualizing Multivariate Data. Journal of Statistical Software 8(11), 1-20.

6. Google Play, https://play.google.com/store/apps/details?id=appinventor.ai_RInstructor.R2, accessed on 8th of April 2014

7. Google Trends Data, http://www.google.com/trends/explore?hl=en-US#q=%2Fm%2F0212jm%2C%20%2Fm%2F05ymvd%2C%20%2Fm%2F01bp2d%2C%20SPSS%2C%20EViews&cmpt=q, accessed on 7th of April 2014

8. Muenchen B., The Popularity of Data Analysis Software, http://r4stats.com/articles/popularity/, accessed on 7th of April 2014

9. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

10. URL http://www.R-project.org/ 11. Smith, D. (2011), Data Science: a literature review, available at:

http://blog.revolutionanalytics.com/2011/09/data-science-a-literature-review.html, accessed on 8th of April 2014

Trademarks 1. RStudio, Revolution Analytics, SAS Institute, IBM SPSS Statistics, Stata

and EViews are registered trademarks of their respective companies.


Statistical Data Analysis via R and PHP: A Case Study Of the Relationship Between GDP and Foreign Direct Investments for The Republic Of Moldova PhD Candidate Ştefan Cristian CIUCU ([email protected]) The Bucharest University of Economics Studies, Romania .

ABSTRACT

This paper provides an overview over a way of integrating R with PHP script-ing language in order to analyze statistical data (time series). We analyze the relationship between the foreign direct investments and GDP of the Republic of Moldova over 1992-2012 time period. Keywords: R software; PHP; programming; GDP; FDI. J.E.L. Classifi cation: C88; L8.

INTRODUCTION

R is a language and environment for statistical computing and graphics. It is a GNU project (free software) which was designed by Ross Ihaka and Robert Gentleman. R software is gaining ground in Romania, as research papers appear. The study of (Caragea, 2012) can be mentioned, mostly for the fact that it underlines the importance and usability of R for statistical computations, data analysis, visualization and applications in various fi elds. Also, R is a powerful software that can work with databases, for example in (Dobre, 2013) the manipulation of large databases is treated. At the moment, PHP is a popular general-purpose scripting language that is especially suited to web development1. PHP is behind most of the websites, from blogs to presentation websites or web-based applications.

1. http://php.net/


It has appeared in 1995 and it is infl uenced by Perl, C, C++, Java and Tcl programming languages. The idea of integrating R with PHP has been around for a few years. A software called R-PHP, is developed within the Department of Statistical and Mathematical Sciences of the University of Palermo (Italy) and can be used as it is an open-source project with the code released by the authors with a General Public License (GPL) (so it can be freely installed and used). A lot of information can be found in (Mineo, Pontillo, 2006). Also, there are many studies of the impact of FDI on GDP. For example, the study of (Agrawal, 2011). The foreign direct investments are very important for countries in transition. They usually bring a great plus to the economic development of a country.

THE USE OF PHP IN STATISTICS

PHP language has some mathematical extensions that include numerous functions. Three of the extensions are worth to be mentioned: “Math” (with basic Mathematical Functions), “Statistics” - a statistics extension that contains functions for statistical computations and “Trader” - Technical Analysis for Traders which contains some functions for linear regressions. All of these three extensions contain a few dozens of functions useful for statistical computations, but a large part of them are not documented at the moment (see [3] and [7]). From the statistical functions of PHP we can mention: - stats_absolute_deviation - returns the absolute deviation of an array of values; - stats_cdf_f - calculates any one parameter of the F distribution given values for the others; - stats_cdf_t - calculates any one parameter of the T distribution given values for the others; - stats_standard_deviation - returns the standard deviation; - stats_variance - returns the population variance; - trader_linearreg - linear regression; - trader_linearreg_slope - linear regression slope; - trader_linearreg_intercept - linear regression intercept; - trader_linearreg_angle - linear regression angle.

For example, the description provided on php.net website for the linear regression function:


array trader_linearreg (array $real [,integer $timePeriod]) where real - Array of real values. timePeriod - Number of period. Valid range from 2 to 100000.

and the return values:

Returns an array with calculated data or false on failure. An example of PHP for statistics implementation: • stats_stat_correlation - it calculate the Pearson’s Correlation

Coeffi cient of two arrays.<?phpfunction correlation($x, $y){$PPMCC = stats_stat_correlation($x, $y);echo “Pearson product-moment correlation coeffi cient is “ .$PPMCC;}$array_x = array(5,3,6,7,4,2,9,5);$array_y = array(4,3,4,8,3,2,10,5);correlation($array_x, $array_y);?>

The main observation is that plain PHP is not powerful enough for statistical analysis and doesn’t have enough already made functions for statistics or econometrics (even if we custom create the possibility of reading data from fi les or validate data). Also, writing PHP code for statistics and econometrics can be pretty laborious and requires massive programming knowledge (mostly because a lot of the functions are not documented).

R VIA PHP R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classifi cation, clustering and so on) and graphical techniques, and is highly extensible1. Accessing R via PHP can be done multiple ways and when implemented, it provides great fl exibility for the user. Basically a web application using R, accessible from anywhere (over an Internet connection) can be done. The fi nal user only needs the installation of a Web Browser and a stable Internet connection.

1. http://www.r-project.org/


In this article we will take a look at a way to access R via PHP already implemented, with a software called R-PHP. R-PHP implements two modules: 1. The fi rst module allows the simple insertion of the R code and it

prints its output (analyses and plots) in another page. 2. The second module makes some statistical analysis by using a

GUI.It is available for download with proper documentation and demonstrations. In our case, in order for R-PHP software to work, it was installed on a computer with an Intel Core2Duo E8500 3.16 GHz processor, 4GB RAM, Linux version Fedora 13 and with PHP version 5.3.8. Also, the R-PHP software can be confi gured to work on a virtual machine, for example Oracle VM VirtualBox. Besides the standard procedure of installation, the following steps and commands are necessary: • chmod 0777 R/pages/tmp • chmod 0777 R-gui/pages/tmp • in the fi le include/conn.php at the command “CREATE TABLE

‘dangerous’ ... - delete “TYPE=MyISAM”

The chmod is a Unix shell command used to change access permissions to fi les and directories. The rights are given to three groups: OWNER, GROUP and OTHERS, the 7 gives the read, write, and execute permissions.


Home page of R-PHPFigure 1

Using the fi rst module requires knowledge of R programming language, but it gives total freedom to the user to run any commands supported by R software on the data. The second module is a user friendly implementation and requires no programming skills. As a fi nal user, you upload the data on the server and then you can run some statistical analysis. The menus are pretty suggestive and well thought out. Using the second module the fi nal user can run a descriptive analysis, a linear regression, an analysis of variance, a cluster analysis, a principal component analysis, a factor analysis or a metric multidimensional scaling.


R-PHP module no. 1 Figure 2

Command line box

File upload

R-PHP module no. 2 menuFigure 3

DATA ANALYSIS USING R-PHP

In this section of the article, R-PHP will be used in order to analyze the relationship between GDP and foreign direct investments in the Republic of Moldova. This section has two major goals: 1. to show basic way of using R-PHP for data analysis; 2. to interpret data and draw some conclusions about the economic situation in the Republic of Moldova. This paper adopts a country-specific time series data from 1992 to


2012 (data for a longer period of time is unavailable). The data source is The World Bank - http://data.worldbank.org1. The data used:

GDP and FDI of the Republic of Moldova over 1992-2012 time periodTable 1

Year GDP (US$) FDI (US$) Year GDP (US$) FDI (US$)

1992 2.319.243.407 17.000.000 2003 1.980.901.554 73.750.000

1993 2.371.812.924 14.000.000 2004 2.598.231.467 87.690.000

1994 1.702.314.353 11.568.000 2005 2.988.172.424 190.700.000

1995 1.752.995.314 25.910.000 2006 3.408.454.198 258.680.000

1996 1.695.130.484 23.740.000 2007 4.402.495.921 536.020.000

1997 1.930.071.445 78.740.000 2008 6.054.806.101 726.610.000

1998 1.639.497.207 75.500.000 2009 5.439.422.031 135.150.000

1999 1.170.785.048 37.890.000 2010 5.811.622.394 201.500.000

2000 1.288.420.223 127.540.000 2011 7.015.201.446 276.420.000

2001 1.480.656.884 54.540.000 2012 7.252.769.934 184.940.000

2002 1.661.818.168 84.050.000

1. Consulted on 15 February 2014.


In the above table we have: - GDP (current US$)1 - GDP at purchaser’s prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars. Dollar fi gures for GDP are converted from domestic currencies using single year offi cial exchange rates. For a few countries where the offi cial exchange rate does not refl ect the rate effectively applied to actual foreign exchange transactions, an alternative conversion factor is used. - Foreign direct investment, net infl ows (BoP, current US$)2 - Foreign direct investment are the net infl ows of investment to acquire a lasting management interest (10 percent or more of voting stock) in an enterprise operating in an economy other than that of the investor. It is the sum of equity capital, reinvestment of earnings, other long-term capital, and short-term capital as shown in the balance of payments. This series shows net infl ows (new investment infl ows less disinvestment) in the reporting economy from foreign investors. Data are in current U.S. dollars. The data must be prepared for the R software, so a fi le with the data will be created, a *.txt fi le, with the data organized on columns with a tab separator. R supports many fi le types, but in this article a *.txt fi le will be used.

R-PHP - fi rst module - code insertion In the browser, after accessing the URL of the installed application, the fi le from the computer will be selected (in order to be uploaded on the server) and then in the command line of the R-PHP application, the following code can be run.date.analizate <- read.table(“a.tab.txt”, header=TRUE)attach(date.analizate)names(date.analizate)date.analizate

After running the code, a new tab will be created in the browser with the following result (meaning that our data has been read and stored into a data-frame):

1. http://data.worldbank.org/indicator/NY.GDP.MKTP.CD2 http://data.worldbank.org/indicator/BX.KLT.DINV.CD.WD


R-PHP data read outputFigure 4

Now that our data could be read by the R via PHP, we can run typical R commands for statistical analysis.


For data summary, the following code can be run:date.analizate <- read.table(“a.tab.txt”, header=TRUE)summary(date.analizate)

In a new browser tab, the summary will be presented:

Data summaryFigure 5

It can be seen that R analyzed the three variables successfully. For each variable the min., max., median, mean and 1st and 3rd quantiles are displayed. Next, a linear regression for the data will be run. The regression model used is:

where

= GDP (current US$); = foreign direct investment, net infl ows (BoP, current US$); = random variable.


The code for the linear regression is:date.analizate <- read.table(“a.tab.txt”, header=TRUE)summary(lm(date.analizate$GDP~date.analizate$FDI))

Where the lm command runs the linear regression and the summary gives in the output all the details needed. The lm command only returns the coeffi cients, so the summary command is added.

Linear regression – GDP & FDI variablesFigure 6

The regression equation can be written as:

It can be noticed that GDP and FDI are positively correlated.


From the output, the coeffi cient of determination, , meaning that 39,89% of the variability of the GDP is explained by the foreign direct investments. The code for the ANOVA analysis:date.analizate <- read.table(“a.tab.txt”, header=TRUE) anova(lm(date.analizate$GDP~date.analizate$FDI))

ANOVA analysis outputFigure 6

From the ANOVA table (table 6), we can establish the overall signifi cance of the model. The p-value (signifi cance F) is equal to

, which is quite small and the null hypothesis can be rejected. The P value tests the null hypothesis that data from all groups are drawn from populations with identical means1. In our case the overall p-value is small, so we can reject the hypothesis that all the populations have identical means. Also, more complex analysis of the data can be done, but the purpose of this research is to show some of the programming commands in R and the fact that R works via PHP.

R-PHP - second module – GUI The second module is quite interesting because no programming skills in R are required. All of the commands that were ran in section 4.1. can be run 1. http://www.graphpad.com/guides/prism/6/statistics/index.htm?f_ratio_and_anova_table_%28one-way_anova%29.htm


in this module using the appropriate command buttons in the upper part of the Web page. The only step that might cause some problems is the data preparation step. After uploading the fi le with data, the application becomes quite self-explanatory. Depending on what data analysis the user needs, the proper commands from the menu should be chosen. For a quick preview of this module see fi gure 3.

CONCLUSIONS

In the transition Republic of Moldova, as seen from the data presented in this article, the GDP over the 1992 and 2012 time period has an increasing trend. But an increasing GDP doesn’t always mean a good life for everyone. The Republic of Moldova’s biggest problem is that income is distributed unevenly. The foreign direct investments are very important for a country economy because it can create jobs (reduce unemployment) and also increase productivity. In the Republic of Moldova’s case, the FDI are not stable. From our data, it can be observed that there are some years with very high FDI and other years with low FDI. From the linear regression model briefl y presented, it can be observed that, in the Republic of Moldova, between GDP and FDI there is a positive relationship. Using R via PHP, means that a user can access a Web page and run commands on data, without installing R on their computer. It also means that the user will use the hardware and software components of the server, where the application is stored, so if the application is confi gured on a powerful server, then the processing speed of R will signifi cantly increase. The R-PHP software tool developed by the Department of Statistical and Mathematical Sciences Silvio Vianelli of the University of Palermo (Italy) (with contact persons Alfredo Pontillo and Angelo Mineo) is defi nitely an eye-opener and a great software. As further development, more commands in the user friendly area are suitable and a mobile friendly version of the application might be of interest.


References 1. Caragea, N., Alexandru, C. A., Dobre, A.-M., Bringing new opportunities to

develop statistical software and data analysis tools in Romania, Published in: The Proceedings of the VIth International Conference on Globalization and Higher Education in Economics and Business Administration, 2012.

2. Dobre, A.-M., Gagiu, A., Manipulation of large databases with R, presented at EUB-2013 International Conference - Ecological University of Bucharest, 4-5 April, 2013.

3. http://www.php.net/manual/en/ref.stats.php 4. http://dssm.unipa.it/R-php/ 5. Mineo, A., Pontillo, A., Using R via PHP for Teaching Purposes: R-php,

Journal of Statistical Software, Volume 17, Issue 4, October 2006. 6. Agrawal, G., Khan, A., Impact of FDI on GDP: A Comparative Study of

China and India, International Journal of Business and Management, Vol. 6, No. 10; October 2011.

7. http://www.php.net/manual/en/ref.trader.php


Creating statistical reports in the past, present and future PhD candidate Gergely DARÓCZI ([email protected]) BCE, Hungary

ABSTRACT

The paper summarizes the most important milestones in the recent history of computer-aided data analysis, then suggests an alternative reporting workfl ow to the traditional statistical software methods by the means of an R package implementing statistical report templates with annotations in plain English. Keywords: R, reports, reproducible research, literate programming

This paper fi rst provides a brief historical overview of a variety of statistical and reporting tools actively used by practicing data analysts in the past 100 years (Daróczi-Tóth, 2013). That time covered lots of changes both in methodology and in tools: existing methods were improved and new ones were also discovered, on the other hand mainframes, personal computers and nowadays cloud computing became the standard source of processed data instead of statistical tables and the slide rule. This also changed the way how statistics and data analysis are used by an ever growing number of experts, laymen and industries: it is no surprise nowadays to do e.g. customer segmentation without any deep theoretical knowledge on how k-means cluster or latent-class analysis really works. Doing data analysis means something different compared to what Karl Pearson did in the past: in our days, statistical wizards and data-driven decision-making tools can help us to run valid analysis on live databases – even on mobile platforms. For this end, the paper also proposes a way on how to create annotated reproducible statistical templates and reports in R.

DEVELOPMENT OF COMPUTATIONAL HARDWARE RESOURCES

After the appearance of mechanical computers and the fi rst programmable devices, the development of the fi rst electronic computer, the Atanasoff-Berry Computer (ABC), began at the University of Iowa (1937). Sadly, the long lasting process failed to produce the expected results, and the ABC wasn’t able to complete all the scheduled tasks at the end of the project (1942).


But not much later (1946), the fi rst general purpose digital computer, the Electronic Numerical Integrator And Computer (ENIAC), was successfully built at The University of Pennsylvania for the United States Army. This was the fi rst working programmable digital computer in human history, and although the machine weighted 30 tons, it was a real state of the art technology in the 1940’s: the performance was even better than the Mark II’s (developed at Harvard, funded by the United States Navy) that was not capable of storing programs internally. The successor of the ENIAC was built under the supervision of János Neumann (1949). The Electronic Discrete Variable Automatic Computer (EDVAC) was already equipped with a central controlling unit and internal memory, which is an important milestone in the history of computer science. Particularly because Neumann (1993) had already published his results before the time of the presentation of the EDVAC, there were already multiple devices working on the basis of the same concept – including the fi rst computer storing programs, the Electronic Delay Storage Automatic Calculator (EDSAC), which was built at the Cambridge University. At this time, there were already a number of universities around the world working on similar projects, and even the fi rst devices with business purposes had appeared. The fi rst commercially sold universal computer was the UNIVAC (UNIVersal Automatic Calculator), and it was actually used at the US Bureau of Statistics from 1951 (Stern, 1981), where they had already been using electronic devices before. Another historically signifi cant tool was the punched card reader, which was fi rst used for tabulating the 1890 census, developed by Herman Hollerith and the Tabulating Machine Company, which is considered as the precursor of IBM (Truesdell, 1965). The punched card reader was a major success both in the United States and in Europe, however because of a strong rise in rental prices, the US Census Bureau started to look for an alternative solution. First, they tried to cooperate with Simon North, however that didn’t bring any success, so after the request from James Powers and John Mauchly, the Bureau started to support the development of UNIVAC. It’s well known, that after its launch (1951), the device had great success: besides the statistical bureau and many institutions of the US Army, multiple business companies started to also use these computers, including e.g. ACNielsen. Next to the UNIVAC, another machine worth mentioning on the market at that time was IBM’s “mainframe” product family. Although it was launched in the 1950s, the real success of mainframes is connected to the second generation of the products in the 1960s (Renfro, 2004). These


machines (IBM 7090/7094) were increasingly stable, due to the air cooling system instead of the old oil cooling one, and were used e.g. by the NASA in the Apollo programs. IBM announced the “System/360” model in 1964. The strongest version was capable of completing tens of thousands of tasks per second, and it’s relatively large memory (8 MB) brought great success for its designers. The business success was partially based on the elimination of the software compatibility problems, the programs became portable, and even more: those programs still work today on any of the IBM servers from the zSeries product family. Compared to the fi rst vacuum tube computers and second generational transistor computers mentioned earlier, the invention (1958) and mass production (1960s) of integrated circuit was a fundamental change, and was crucial for the appearance of the third generational devices with increased computational capacity. The fourth generation of computers was even more integrated, and the performance was sometimes a hundred times better compared to the IBM machines’ from the 1960s. The invention of microprocessor, the possibility to transmit data between computers and the decreasing hardware prices, all led to the appearance of personal computers. IBM released the fi rst PCs in the early 1980s and multiple companies (like Xerox, Hewlett Packard, Apple or Commodore) followed the success story. The history of computer hardware from this point is familiar to most of us, as PCs and related technologies have become part of our everyday life. As computers are getting smaller and smaller, we tend to require those even for the most common tasks, and their role in statistics is obviously crucial. Nowadays, we use notebooks or laptops, smart phones, PDAs and most recently tablets for everyday actions, usually with direct internet access. But these new devices have limited computational resources due to increased mobility, which results in limited support for the most recent statistical methods. A possible solution for this technical problem is to reassign the tasks that require more computational resources to server computers, which led to the birth of cloud technology and online services. The basic concept behind this is to store the data and algorithms on a secure server, while the connecting clients can easily run the queries on mobile devices without any local unnecessary load. This paper will present an example of such infrastructure beside e.g. Shiny (RStudio, 2014) or OpenCPU (Ooms, 2014).


DEVELOPMENT OF STATISTICAL SOFTWARE

Not only did the increased computational power changed how data analysis and reporting works nowadays, but ever since the beginnings of probability theory or e.g. the creation of the least squares method, more and more robust, multivariate methods have been created in the recent few hundred years’ history of demography and statistics. E.g. instead of the analytical explanation and deduction, now we usually apply more practical methods with simulations. To understand this signifi cant change in statistical theory, this paper also provides a brief overview on the evolution of statistical software packages. The fi rst econometric software was developed at Cambridge (Renfro, 2004). Although the EDSAC was already able to run econometric programs in 1953, but it was only used to execute basic operations until the late 1950s, so the general use of statistical software only began later. All the software created at this time were designed to accomplish a specifi c task, and the users had hard time migrating data from a software to another, because those were not prepared to be compatible with each other. To present this problem with a simplifi ed example: an ANOVA program written in machine language may have required a completely different structure for input data, than a program creating e.g. cross tables – even if both were written for the same computer, in the same programming language. The fi rst generation of statistical software packages only appeared in the mid 1960s, and they overcame the earlier problems by providing a complex solution: the user could apply a variety of statistical methods on the same structured data (Leeuw, 2011). The more than 30 years long career of BMD and later the BMDP (BioMeDical Package) started in 1965 at the UCLA Department of Medicine. This originally free statistical program was created for calculations regarding health sciences. Later it became a proprietary product until acquired by SPSS Inc. (1996), when the development of the software was discontinued. The SPSS (Statistical Package for the Social Sciences) software is well known amongst social scientists, which was released in 1968 by the University of Chicago. The success can be measured by the fact that Wellmann (1998) noted the user manual of SPSS as one of the most infl uential books (Nie, 1970). At that time, SPSS was exclusively focused on social sciences, and it started to expand towards other fi elds only later, when acquired by IBM (2009). Since then, SPSS refers to “Statistical Product and Service Solutions” along with PASW (Predictive Analytics SoftWare). The program was originally created to process punched cards and to work only from command line, but it has later


established its own fi le structure (sav), a graphical user interface (1985), then a Java based, platform independent version (2007). Nowadays, there are a variety of add-ons besides the “Base” package to help the users accomplishing a wide range of statistical tasks – let it be the creation of questionnaire, designing samples or a summary of the results. The SAS (Statistical Analysis Software) package is similarly wide spread and well known; however it’s mainly focused on business processes. It was fi rst released at the North Carolina State University (1968), and by now it’s one of the biggest service providers in the business intelligence industry beside MicroStrategy, IBM Cognos, Oracle Hyperion, Microsoft BI and SPSS Modeler. The foundations of SAS were laid down by a former student of NCSU, who began to develop a framing structure after the implementation of ANOVA and multivariate linear regression (1966). The popularity of the package was partly due to the fact that the developing team managed to effi ciently deal with missing data. There were several important milestones in the development of SAS, like the fi rst platform independent release in the early 1980s, supporting a variety of mini (not mainframe) computers, then the change-over from the PL/I, FORTRAN and machine languages to the C programming language. By now, SAS also provides a server hosted, so called “on-demand” service. Leeuw (2011) dates the appearance of the second generation of statistical software packages to 1985, when a graphical user interface was added to all above mentioned software, and apart from those, new ones also appeared on the market, focusing mainly on fi ne-tuning the graphical interface and user experience. Data Desk was released in 1986 for Macintosh computers with the primary goal of helping exploratory data analysis. The main advantage was the user friendly and interactive user interface, which allowed even the non-experts to achieve spectacular results. Since 1997, it has been available for Windows as well; however the development has been discontinued after all. Not much later (1989), JMP (jump) was also released for Macintosh devices by one of the co-founders of SAS. The developers focused on improving the graphical user interface, which resulted in interactive plots and graphics for exploratory data analysis. A signifi cant part of the success of STATA (1985) is due to its community and the user activity, as STATA made it possible and easy to use and reference STATA codes uploaded to the internet as user contributed code in “ado” format. This community and user base is relatively large, and e.g. the STATA mailing list has an extraordinary traffi c (more than a thousand e-mails per month) compared to the software mentioned earlier or to any other


commercial statistical software. STATA is still under active development, and it also has a graphical user interface since 2003. The S programming language was started by John Chambers, and it was actively used as early as the late 1970s in the internal network of Bell Laboratories. The great advantage of S was, compared to the earlier FORTRAN programs written for specifi c tasks, that it used standard commands to do interactive data analysis a statistical modeling, and these functions and statistical methods were also easily accessible for the developers. The computer program was later ported UNIX (1980) from the General Comprehensive Operating System designed for mainframe computers, and the release of the program (1981) and later the source code (1984) also guaranteed success for its successors, like R. The “New S” language was released at the late 1980s with real support for functions instead of the macros used before, new graphical devices (X11 and PostScript) became available, and the “formula-notation”, S3 and later the S4 methods were also introduced at that time that are used even today. Although S is still available today, some alternative implementations became far more popular among the users. For example according to the TIOBE-index measuring the popularity of different programming languages, R is amongst the top 30 most used programming languages, and the commercial version of S (S-PLUS) has been also in the top 100 multiple times. The R language was developed based on the SCHEME language created by Gerald Jay Sussman and on the results of S (Hornik, 2012). The rewriting of SCHEME functions and features began in 1993 at the University of Auckland with the leadership of Ross Ihaka and Robert Gentleman. The fact that John Chambers, the author behind the original idea and development of S, is also in the R Development Core Team, probably indicates the signifi cance of the success of R as well. This is open source software: free to use, distribute and modify under the GPL v2 license and it is also supported by the Free Software Foundation as being part of GNU. The core programming environment available for multiple operating systems (Windows, Macintosh and Linux) for free, and what’s more, by now, many different types of graphical user interface and front-ends help everyday’s work of R users. Integrated development environments (like Eclipse/StatET, Emacs/ESS, Rstudio, TextMate, Notepad++, etc.) are also available. Besides the fact that it is free, the great success of R also due to CRAN (Comprehensive R Archive Network), which is a central repository of user contributed packages. By now, there are over 5000 libraries on CRAN, and they mostly cover all the currently available theoretical statistical methods.


Although anybody can upload new packages to CRAN, and the operators of the network only run automated tests on those, the great number of users, the active community (GitHub, StackOverfl ow, and e.g. the [R-help] mailing list with more than 3000 messages per month etc.) and their constant feedback guarantees the maintenance and further development of the software. The R Core Development Team also offi cially supports the stable operation of base libraries, for example R has became a standard and certifi ed statistical tool for clinical trials (The R foundation, 2012). Besides those presented above, there are also many other business software packages (MATLAB, Mathematica, Statistica etc.) available on the market, however as they are not too signifi cant considering the subject of this paper, we won’t discuss them in further detail.

LITERATE PROGRAMMING

Similarly to how the development of general statistical software and environments supplanted the use of custom computer programs created for specifi c tasks, the workfl ow of creating reports had also changed. For example, reproducibility of research results became more and more important also in programming, just like in natural sciences. First, this resulted in detailed comments in source code, later source code started to appear in research paper, which later ended up in mixed texts including e.g. statistical commands in so-called chunks inside of papers, which were evaluated and replaced by the results while compiling the document. We believe that this method resulted in signifi cant change in writing reports, as there is no further need to do data analysis and formatting texts etc. with different tools, but all these tasks can be managed within a single piece of software, so that the user can concentrate on the real scientifi c tasks, not on the software environment. The fi rst implementation of such literate programming tool (Noweb) was released in 1994 (Johnson, 1997), and soon an R implementation was also published, called Sweave (Leisch, 2002). Sweave is a great tool, but it only supports the pdf fi le format requiring LaTeX knowledge with a rather steep learning curve, so several similar alternatives also appeared on CRAN later. There are packages provide support for HTML, Open/LibreOffi ce, MS Word etc. document formats; an almost complete list can be found on the dedicated CRAN Task View (Zeileis, 2005). Despite the considerable amount of already existing similar tools, in 2011, two independent R developers and teams decided to build new packages to provide an alternate way for literate programming. The main motive of Yihui Xie (2012) with knitr was to replace Sweave with increased functionality (like caching) and support for multiple fi le formats.


This later feature and the general ease of use gained a huge interest for knitr in the R community, which became one of the most trending R package. Gergely Daróczi and Aleksandar Blagotić (2012) started to work on a similar alternative in 2011, which development resulted in the pander, rapport and rapportools packages. Unlike knitr, the original vision of these packages was not to provide an alternative tool for literate programming, but we decided to create an environment that supports textual statistical templates in plain English. These templates can be considered similar to R functions that can be applied to any dataset in R, but the results are markdown-formatted textual reports instead of R objects. For example the already existing ANOVA template can produce the usual tables and graphs with annotations in plain English, so that any student could understand the results of the applied analysis run in the background.

REPRODUCIBLE STATISTICAL TEMPLATES

From a technical point of view, all these templates are made of two important parts. The fi rst one (“header”) defi nes meta-data of the statistical template, where the author can provide details about the goal and the output of the template, the list of required R packages and a few examples of usage, similarly to the R documentation, and the latter part (“body”) contains all the plain text in English to be shown to the user along with the R expressions to be evaluated. The header also includes the optional or required inputs, which are to be provided by the user on call. One may consider these inputs as R function arguments, e.g. the user can pass data frames or vectors, numbers and string, also any number of similar options to the template to be processed later. The inputs may be any R class from character, complex, factor, integer, logical, numeric or raw, and several attributes might also be defi ned for those. A simple example for such header part:

This header was written in YAML syntax, which is a human-readable format to represent hierarchical list data. Beside the title and description fi elds, we have listed two required R packages for this template, then listed two inputs from which only one is compulsory. This prior input requires a numeric vector to be passed, while the latter is an optional string which defaults to “red” if omitted. Anything written after the closing “head-->” tag belongs to the body of the template:


# A quick analyis on <%= v.name %>

The mean of <%= v.name %> is <%= mean(v) %> and thestandard deviation is <%= sd(v) %>. Let us alsocheck the frequency table:

<%= table(v) %>

## Tables are boring!

<%=set.caption(paste(’Histogram of’, v.name))hist(v, xlab = v, col = color, main = ’’)%>

This part starts with a markdown formatted, Atx-style header which also includes an R expression: to return the name of the variable passed as “v”, which was defi ned in the header above. Please note that there is no need to cat, print or e.g. xtable the R objects we generate, as all R code chunks are automatically passed to the pander S3 method, which turns any R object to human-readable Pandoc’s markdown (MacFarlane, 2012) that can be easily exported to various document formats. The rest of the body follows a similar syntax: plain text in regular English is mixed with some R expressions in code chunks. Now let’s see the results of this very basic report template applied on the transmission variable in the mtcars bundled dataset of R (Henderson–Velleman, 1981):

> library(rapport)> rapport(‘demo.rapport’, data = mtcars, v = ‘am’)

# A quick analyis on am

The mean of am is _0.4062_ and the standarddeviation is _0.499_. Let us also check thefrequency table:------- 0 1 --- ---19 13 -------


## Tables are boring!

![Histogram of am](plots/rapport-demo-rapport-1-1.png)

So after loading the package, the rapport function takes a few arguments: the fi rst one refers to the template to be used, then a data frame is passed. The v parameter stands for the fi rst input defi ned in the header, and we omit the second input, which will default to “red”. All the R chunks were processed by rapport, the returning R objects were transformed to markdown, and the histogram was saved to disk, which can be easily included in other document formats like MS Word docx, pdf or HTML. To automatically create and open such documents at one go, use e.g. rapport.docx, rapport.pdf or rapport.html instead of rapport. For more details, please check the documentation.

REPORTING IN THE CLOUD

Although running a rapport template is as easy as seen above, this process still requires a working R and Pandoc installation on the user’s computer beside some R knowledge to load the data and to run the above few commands. For this end, we have created a cloud environment hosting such report templates that can be evaluated in any Internet browser even on mobile devices and without any prior R or statistical knowledge. A proof of concept demo front-end of the above defi ned template can be seen on Figure 1, which can be also accessed online at http://bit.ly/1kWpqMl. This tool provides a simple way to create similar front-ends and user interfaces for R developers, and it also let non-experts use the reporting templates integrated into any web- or mobile application – which is already trending in the data analysis sector.


Web application front-end to the statistical report templateFigure 1

References

1. Chambers, J. M. [1980]: Statistical Computing: History and Trends. The American Statistician. 34(4): 238–243.

2. Daróczi, G. [2012]: sandboxR: fi ltering malicious R calls. https://github.com/Rapporter/sandboxR

3. Daróczi, G. [2013]: pander: an R Pandoc Writer. CRAN. 4. Daróczi, G. – Blagotić, A. [2013]: rapport: an R Templating System.

CRAN. 5. Daróczi, G. – Tóth, G. [2013]: Felhőtlen statisztika a felhőben. Hungarian

Statistical Review. 91(11): 1118–1142. 6. Daróczi, G. [2014]: rapportools: Miscellaneous (stats) helper functions with

sane defaults for reporting. CRAN. 7. Francis, I. [1981]: Statistical Software: A Comparative Review. Elsiever.

New York. 8. Leisch, F. [2002]: Sweave: Dynamic Generation of Statistical Reports

Using Literate Data Analysis. In: Proceedings in Computational Statistics. Physica Verlag. Heidelberg. 575–580.


9. Henderson, V. [1981]: Building Multiple Regression Models Interactively. Biometrics. 37(2): 391–411.

10. Hornik, K. [2012]: The R FAQ. CRAN. 11. Johnson, A. L. – Johnson B. C. [1997]: Literate Programming Using

Noweb. Linux Journal. 42: 64–69. 12. Jong, V. J. de. [1989]: A Specifi cation System for Statistical Software.

Centrum voor Wiskunde en Informatica. Amsterdam. 13. Leeuw, J. [2011]: Statistical Software: An Overview. In: Lovric, M. (ed.):

International Encyclopedia of Statistical Science. Springer. Berlin. pp. 1470–1473.

14. MacFarlane, J. [2012]: Pandoc: A Universal Document Converter. http://johnmacfarlane.net/pandoc/

15. Nie, N. H. – Bent, D. H. – Hull, C. H. [1970]: SPSS: Statistical Package for the Social Sciences. McGraw-Hill. New York.

16. Ooms, J. [2013]: The RAppArmor Package: Enforcing Security Policies in R Using Dynamic Sandboxing on Linux. Journal of Statistical Software. 55(7): 1-34.

17. Ooms, J. [2014]: opencpu: OpenCPU framework for embedded statistical computation and reproducible research. CRAN.

18. R Development Core Team [2014]: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna. http://www.r-project.org/

19. R Foundation for Statistical Computing [2012]: R: Regulatory Compliance and Validation Issues. A Guidance Document for the Use of R in Regulated Clinical Trial Environments. http://www.r-project.org/doc/R-FDA.pdf

20. Renfro, C. G. [2004]: Computational Econometrics: Its Impact on the Development of Quantitative Economics. IOS Press. Amsterdam.

21. Renfro, C. G. [2009]: The Practice of Econometric Theory: An Examination of the Characteristics of Econometric Computation. Springer. Berlin.

22. RStudio (2014). shiny: Web Application Framework for R. CRAN. 23. Routh, D. A. [2007]: Statistical Software Review. British Journal of

Mathematical and Statistical Psychology. 60(2): 429–432. 24. Stern, N. B. [1981]: From Eniac to Univac: Appraisal of the Eckert-Mauchly

Computers. Digital Press. Bedford. 25. Valero-Mora, P. M. – Ledesma, R. [2012]: Graphical User Interfaces for R.

Journal of Statistical Software. 49(1): 1–8. 26. Von Neumann, J. [1993]: First Draft of a Report on the EDVAC. IEEE

Annals of the History of Computing. 15(4): 27–75. 27. Wellman, B. [1998]: Doing It Ourselves: The SPSS Manual as Sociology’s

Most Infl uential Recent Book. In: Clawson, D. (ed.): Required Reading: Sociology’s Most Infl uential Books. Amherst: University of Massachusetts Press. Amherst. 71–78.

28. Xie, Y. [2012]: knitr: A General-Purpose Package for Dynamic Report Generation in R. CRAN.

29. Zeileis, A. [2005]: CRAN Task Views. R News. 5(1): 39–40.


The Progress of R in Romanian Offi cial Statistics Ana Maria DOBRE (e-mail: [email protected]) National Institute of Statistics, Romania Cecilia Roxana ADAM (e-mail: [email protected]) National Institute of Economic Reasearch “Costin C. Kiritescu” of the Romanian Academy, Romania

ABSTRACT

The present paper exposes an overview of the state-of-the-art of R statistical software in the offi cial statistics in Romania, predominantly in the social statistics. Ex-amples on data analysis and econometric models of Small Area Estimation success-fully completed are given. The scientifi c approach includes also a summary of the applications of R in other statistical offi ces around the world. Other countries like United Kingdom or Neth-erlands are truly experienced in the use of R. We conclude with a series of proposals on the future research opportunities and other potential analysis procedures of R in the social statistics. Keywords: R Software, Offi cial Statistics, Social Statistics, Econometric Models, Small Area Estimation JEL Classifi cation: C13, C18, C88

INTRODUCTION

In 2011, opportunities to develop a long-way but strong implementation of the Small Area Estimation techniques have risen in the Romanian offi cial statistics because of the lack of availability of fi gures on international migration. Nevertheless, the computational method chosen was R Software because it seemed to enclose all the advantages needed for developing the estimation model. The small team from international migration using R grew and started to promote R both in offi cial statistics and academic research, including universities. This team was the core of the Romanian R User Group, founded on 4th of April 2013. The examples on the use of R in other countries underlie the need for spreading R in the Romanian offi cial statistics. A transition and implementation strategy for this new environment could be needed.


LITERATURE REVIEW

Eurostat itself may require in the near future the use of R in the statistical offi ces across European Union. The arguments could be easy to deduce: low costs, easy customization and use of packages, technical support provided by a large community of users, continuous upgrade and harmonisation in data formats at European statistics offi ces level. An argument for Eurostat is considering the use of R is given in the following. In 2012, Eurostat released the report “Analysis of the future research needs for Offi cial Statistics”. In the mentioned paper Eurostat provides an analysis of the research tools needed in the statistics offi ces. In the report is mentioned that “open source architectures will expand in the future, R software is an example.” According to Eurostat approach, an integrated use of commercial software and open source software is a foreseen strong tendency, e.g. data and code sharing between products, use of R programs under SAS or SPSS. Within the European Statistical System, large commercial products like SAS and SPSS are often used in production processes and open source software such as R Software are used for methods and technology development, experimentation and statistical innovation. Also, Small Area Estimation is presented as a hot topic in the report, SAE methods connected with visualisation tools being an important future research area. In this report, Eurostat presents the results of the ROS (Research needs in Offi cial Statistics) Survey, conducted between 2010 and 2011. The questionnaire was sent by e-mail and was available online on CROS Portal. The sample had 442 respondents from NSIs, research institutes, universities and others. The results of the survey show that at NSIs, SAS and R is the most applied software and universities have a strong preference for R. Research institutes use different software with the main focus on R, Stata, and SPSS. A strong documented (Todorov, 2010) implementation of R software is the one of United Nations Industrial Development Organization. Todorov explains that the statistical techniques available in R e.g. linear and non-linear modeling, classical statistical tests, time-series analysis, clustering, as well as its data manipulation and reporting tools make this software an ideal integrated environment for both research and production in the offi cial statistics. The state-of-the-art of R at UNIDO continued in 2012 (Todorov, Templ, 2012) along this line and presents three important areas of data processing and data analysis, typical for the activities of a national or international statistical offi ces. These areas consist in: missing data and imputation methods, editing and outlier detection and statistical disclosure control.


R-EVOLUTION IN ROMANIAN NATIONAL INSTITUTE OF STATISTICS

In this section we aim to present an overview of the state-of-the art of R in offi cial statistics of Romania, year-by-year. This timeline is created predominantly for the social statistics. In 2011 appeared the need for a computational tool to develop a Small Area Estimation model. In the last months of the year 2011 a small team of statisticians from International Migration Department started to use R for this purpose. R has been choosen since it is by far the most used open source statistical software among data scientists and academic communities. In 2012 appeared the fi rst steps and results in using R for Small Area Estimation with packages JoSAE (Breidenbach, 2011) and nlme (Pinheiro et. al., 2014). The package JoSAE is an implementation of the classical methodology of Rao (2003). In October 2012 the Romanian R Team (www.r-project.ro) was founded. 2013 was a full year, with a plenty of activities. In April, was organized the fi rst Workshop on R – State-of-the-art statistical software commonly used in applied economics, held as a section within EUB-2013 International Conference. There were presentations and free speaking about the advantages of implementing R in academia and offi cial statistics in Romania. In May 2013, the Small Area Estimation methodology has been successfully completed. The model was based on two data sources – Labour Force Survey and Population and Housing Census. Accurate estimates at NUTS 3 level (county level) have been obtained, outlining fi gures on international migration statistics. From June to December, courses on R have been held under the aegis of National Centre for Training in Statistics. About 50 employees have been trained on the course “Introduction in Small Area Estimation Techniques with Applications in R”. The course was conceived as an introduction to R, econometric modelling and Small Area Estimation techniques. The structure of the course was as follows: • Introduction to R: installation, on-line community and resources,

GUIs • Overview of R data types: object-oriented programming, vectors,

lists, dataframes, matrices • Importing and exporting data from/to the following formats: txt,

Excel, csv, SQL, SPSS, SAS, DBF • Functions in R


• Graphics in R: histograms, scatterplots, box-and-whiskers plots, boxplots, scatterplot matrices, 3d plots

• Regression models: linear model, multiple linear regression, logit, probit

• Specifi cation and choosing the independent variables in modelling • Small Area estimation techniques

As part of the research activities of Romanian R Team, in August 2013 has been released the Romanian version of the well-known book R for Beginners (Paradis, 2005). In 2013, R started to be used as main tool for data editing and imputation in business surveys in the Romanian offi cial statistics.As a follow-up of the timeline of progress of R in Romanian offi cial statistics, in March 2014 was organized the second of a series of events dedicated to the use of R Project in Romania: International Workshop New Challenges for Statistical Software - The Use of R in Offi cial Statistics. The workshop was an opportunity to develop new ideas and cooperation in the fi eld of offi cial statistics and academia. In 2014, the National Centre for Training in Statistics has already planned new courses based on the use of R: • “R Statistical Software – Presenting Advantages of its use for Data

Analysis” • “Introducing Statistics, the Need for Offi cial Statistics” • “Statistical Analysis – from Theory to Practice” • “Concepts, Models and Techniques for Data Analysis”


R IN OTHER STATISTICAL OFFICES

In this section we will expose the spread of R in other statistical offi ces around the world, according to Figure 1.

Statistical offi ces using RFigure 1

In the following we will detail the use of R in many of the countries on the map.

Austria is a pioneer in the use of R in academia, offi cial statistics and business fi eld. Vienna is the location of the headquarter of “R Foundation for Statistical Computing”, developed by R Development Core Team in order to: provide support for the R project, provide a reference point for individuals, instititutions or commercial enterprises that want to support or interact with the R development community and hold and administer the copyright of R software and documentation (R Development Core Team, 2005). In Statistics Austria, R is used since 2004 and the experts from there even developed add-on packages for their needs and methodologies. The CRAN Task View for Offi cial Statistics was developed by an expert from Statistics Austria (Templ, 2014).


Istat (National Institute for Statistics, Italy) is using R for sample design, for calibration and calculation of sampling variance, for selective editing, for record linkage, for statistical matching and for small area estimation. Istat has donated software libraries to R and has started to migrate from SAS even since 2009 (European Commission News, 2009). Netherlands is another country with best practices in using R. Statistics Netherlands developed packages for their methodologies. R is used on three forms, depending on its use in statistics production, statistical research, or research in methods and computation (Van Der Loo, 2012). These three forms are the following: the production installation, the analyst installation and the research installation. In United Kingdom, at the Offi ce for National Statistics, R is used since 2004, mostly in producing statistics. The Economic and Social Data Service (ESDS) is using R software for analysing large scale government surveys (Walthery, 2012). For the United States governement, there is an emerging awareness and recognition of the power of R in their Big Data Initiative. David Smith (2012), Chief Community Offi cer at revolution Analytics, has highlighted the US approach in using R: harmonize spill estimates from various sources, and to provide ranges of estimates to other agencies and the media; analyze data from clinical trials; research and development of models to predict river fl ooding; provide a tool to track pollution. In the USA, R is used in agencies like CIA, Food and Drug Administration, National Institute of Science and Technology, Consumer Financial Protection Bureau and San Francisco Estuary Institute. Beside of being the birthplace of, New Zealand promotes in its offi cial statistics the use of R. Institutes like Ministry of Business, Innovation and Employment and Department of Conservation use R language for statistical analyses (Statisphere, 2013).

FUTURE RESEARCH OPPORTUNITIES ON R IN ROMANIAN OFFICIAL STATISTICS

Other possible applications of R in the Romanian offi cial statistics are presented below, according to the CRAN Task View Offi cial Statistics and Survey Methodology (Templ, 2014). Almost all these procedures have dedicated packages; other procedures are enclosed in some packages. • Complex survey design: algorithms for drawing survey samples

and calibrating the design weights; computing point and variance estimates; performing simulation studies; comparing different point and variance estimators under different survey designs; comparing


different conditions regarding missing values, representative and non-representative outliers; create complex survey design (stratifi ed sampling design, cluster sampling, multi-stage sampling and probability proportional to size sampling with or without replacement); selecting samples using probability proportional to size sampling and stratifi ed simple random sampling; univariate stratifi cation of survey populations with a generalisation of the Lavallee-Hidiroglou method (Lavallee, Hidroglou, 1988); estimate (Horvitz-Thompson) totals, means, ratios and quantiles for domains or the whole survey sample; estimation of variance for complex designs by delete-a-group jackknife replication for otals, means, absolute and relative frequency distributions, contingency tables, ratios, quantiles and regression coeffi cients even for domains; estimate certain Laeken indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coeffi cient) including their variance for domains and stratas based on bootstrap resampling; compare point and variance estimators in a simulation environment; incorporation of clustering, stratifi cation, sampling weights, and fi nite population corrections into a structural equation modelling analysis; post-stratifi cation, generalized raking/calibration, GREG estimation, trimming of weights; calibrate either on a total number of units in the population, on mariginal distributions or joint distributions of categorical variables, or on totals of quantitative variables; calibrate for nonresponse for stratifi ed samples.

• Editing and visual inspection of microdata: convert readable linear (in)equalities into matrix form; applies deductive correction of simple rounding, typing and sign errors based on balanced edits; selective editing for continuous scaled data; robust location and scatter estimation and robust principal component analysis with high breakdown point for incomplete data; visualize missing values using suitable plot methods; profi le or explore large statistical datasets.

• Statistical discosure control: generation of confi dential (micro)data; simulation of synthetic, confi dential, close-to-reality populations for surveys based on sample data; provide confi dential tabular data.

• Seasonal adjustment: decomposition of time series; graphical user interface for the X12-Arima seasonal adjustment software

• Computing indices and indicators: estimate popular risk-of-poverty and inequality indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coeffi cient); tail modeling


of Pareto distributions for semi-parametric estimation of indicators from continuous univariate; computing various inequality measures (Gini, Theil, entropy, among others), concentration measures (Herfi ndahl, Rosenbluth), and poverty measures (Watts, Sen, SST, and Foster); computing empirical and theoretical Lorenz curves as well as Pen’s parade.

• Statistical record matching between two or more datasources: perform statistical matching between two data sources sharing a number of common variables; linking and deduplicating data sets; nearest neighbor matching, exact matching, optimal matching and full matching amonst other matching methods.

• Small Area Estimation

More than the application on international migration statistics, the SAE method could be used for employment, poverty or education level estimates in social statistics. The estimates obtained by Small Area Estimation modelling would contribute to policy efforts aimed at reducing poverty, inequality and social exclusion, helping to progress towards the goals of the Europe 2020 Strategy or to design better social and economic policies.

CONCLUSION

R Software represents a huge challenge for Romanian offi cial statistics. The current status of using R is an example of best practice. An ideal situation would be that where both statistical researchers and IT experts from offi cial statistics would embrace the use of R. A proposal for future would be a strategy for implementing R and for migrating from Visual Fox and SAS to R. This would be possible with a strong motivation for using R, strong training programme for employees, a wiki-like intranet for know-how sharing, matching migration path to work style, providing technical support and documentation.

ACKNOWLEDGEMENT

The authors are grateful to the R-omanian R Team (www.r-project.ro) and they give their special gratitude to everyone who made this project grow up.


Bibliography

1. Breidenbach, J. (2011) JoSAE: Functions for unit-level small area estimators and their variances. R package version 0.2., http://CRAN.R-project.org/package=JoSAE (Accessed on 9th of April 2014)

2. Breidenbach, J., Astrup, R. (2012) Small area estimation of forest attributes in the Norwegian National Forest Inventory. European Journal of Forest Research, 131, 1255-1267

3. Caragea, N., Alexandru, A.C., Dobre, A.M. (2012) Bringing New Opportunities to Develop Statistical Software and Data Analysis Tools in Romania, The Proceedings of the VIth International Conference on Globalization and Higher Education in Economics and Business Administration, ISBN: 978-973-703-766-4

4. Dobre, A.M., Caragea, N., Alexandru, C. (2013) R versus Other Statistical Software, Ovidius University Annals, 13, 484-488

5. EUB-2013 International Conference, http://www.eub-2013.ueb.ro/sections/ (Accessed on 10th of April 2014)

6. Europe 2020 Strategy, available at: http://ec.europa.eu/europe2020/index_en.htm (Accessed on 10th of April 2014)

7. European Commission News (2009) IT: Statistics institute: moving to open source increases cooperation, https://joinup.ec.europa.eu/news/it-statistics-institute-moving-open-source-increases-cooperation

8. Eurostat (2012) Analysis of the future research needs for Offi cial Statistics, Methodologies and Working Papers, available at: http://epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-RA-12-026/EN/KS-RA-12-026-EN.PDF (Accessed on 10th of April 2014)

9. Ghosh, M., Rao, J.N.K. (1994) Small area estimation: an appraisal. Statistical Science, 9, 55-93

10. Lavallee, P., Hidiroglou, M.A. (1988) On the stratifi cation of skewed populations. Survey Methodology, 14, 33-43.

11. Paradis, E. (2005) R for Beginners, available at: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf (Accessed on 10th of April 2014)

12. Pinheiro J., Bates D., DebRoy S., Sarkar D., R Core Team (2014) nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-115, http://CRAN.R-project.org/package=nlme (Accessed on 10th of April 2014)

13. R Development Core Team (2005) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org.

14. Rao, J.N.K. (2003) Small Area Estimation, John Wiley & Sons, Hoboken, New Jersey.

15. Rao, J.N.K., Sinha, S.K. (2008) Robust Small area Estimation under Unit Level Models, Proceedings of the Survey Research Section, American Statistical Association, 145-153

16. Smith, D. (2012) Applications of R in Government, available at: http://blog.revolutionanalytics.com/2012/06/applications-of-r-in-government.html (Accessed on 10th of April 2014)

17. Statisphere, Offi cial Statistics System seminar 2013, http://statisphere.


govt.nz/seminars-training-forums/official-statistics-seminar-series/archived-presentations/r-language.aspx (Accessed on 10th of April 2014)

18. Templ, M. (2014) CRAN Task View: Offi cial statistics and Survey methodology, available at: http://cran.r-project.org/web/views/Offi cialStatistics.html (Accessed on 10th of April 2014)

19. Todorov, V. (2010) R in the statistical offce: The UNIDO experience. UNIDO Staff Working Paper, Vienna

20. Todorov, V., Templ, M. (2012) R in the statistical offce: Part II, UNIDO Staff Working Paper, Vienna

21. Van Der Loo, M. (2012) The Introduction and Use of R Software at Statistics Netherlands, available at: http://www.amstat.org/meetings/ices/2012/papers/302187.pdf

22. Vergil. V., Caragea, N., Pisica, S. (2013) Estimating International Migration on the Base of Small Area Techniques, Journal of Economic Computation and Economic Cybernetics Studies and Research, Bucharest, 3, http://www.ecocyb.ase.ro/nr.3.pdf/Voineagu%20Vergil.pdf, (Accessed on 10th of April 2014)

23. Walthery, P. (2012) updated by Rosalynd Southern (2013), The R Guide to UK Data Service key UK Service, UK Data Service, University of Essex and University of Manchester, available at: http://ukdataservice.ac.uk/media/398726/usingr.pdf (Accessed on 10th of April 2014)


Multilevel model analysis using R Nicolae-Marius JULA ([email protected]) Nicolae Titulescu University of Bucharest

ABSTRACT The complex datasets cannot be analyzed using only simple regressions. Mul-tilevel models (also known as hierarchical linear models, nested models, mixed models, random coeffi cient, random-effects models, random parameter models or split-plot de-signs) are statistical models of parameters that vary at more than one level.1 Multilevel models can be used on data with many levels, although 2-level models are the most common. Multilevel models, or mixed effects models, can be estimated in R. There are several packages available in CRAN. In this paper we are presenting some common methods to analyze these models. Keywords: Multilevel analysis, R, CRAN, package Jel Classifi cation: B23, C23, C33, C87

INTRODUCTION

Multilevel models are usually used in statistical analysis of data that have a hierarchical or clustered structure. One can fi nd such data in various fi elds, like in educational research (schools – classes), social studies (families – members), medical research (patients nested within hospitals) and so on. Clustered data may also appear as a result of the particular research design. For instance, in large scale survey studies the data collection is usually organized in multistage sampling design that results in a clustered or a stratifi ed design. Of course, this approach is not used exclusively in statistic studies, but the usual practice of these models are in this fi eld of statistics or in a related one.There was a period of time when statisticians ignored this multilevel structure and they performed the analyses by simply disaggregating all the data to the lowest level and then using the common standard analyzing models. This approach is not problems free. One of them is related to sampling variance, the so-called design effect (deff), detailed by Kish2 in 1965. The

1. Bryk, Stephen W. Raudenbush, Anthony S. (2002). Hierarchical Linear Models: Applica-tions and Data Analysis Methods (2. ed., [3. Dr.] ed.). Thousand Oaks, CA [u.a.]: Sage Publica-tions. ISBN 0-7619-1904-X2. Kish, L. (1965): Survey Sampling. Wiley, New York.


design effect can be seen as the loss of effectiveness by the use of cluster sampling, instead of simple random sampling. The design effect is basically the ratio of the actual variance, under the sampling method actually used, to the variance calculated under the assumption of simple random sampling. As Turner stated: “The interpretation of a value of (the design effect) of, say, 3.0, is that the sample variance is 3 times bigger than it would be if the survey were based on the same sample size but selected randomly. An alternative interpretation is that only one-third as many sample cases would be needed to measure the given statistic if a simple random sample were used instead of the cluster sample with its (design effect) of 3.0”1

The design effect can be calculated: DEFF = 1 + δ (n – 1) (1) Where: DEFF - design effect, δ - intraclass correlation for the statistic in question, n - average size of the cluster

Looking at this equation, DEFF equals 1 only when either the intraclass correlation is zero (δ = 0), or the cluster size is one (n = 1). In all other situations DEFF is larger than one, which denotes that standard statistical formulas will underestimate the sampling variance, meaning that we may obtain signifi cance tests with an infl ated alpha level (type I error rate). There were tests conducted by Tate and Wongbundhit2 and they stated that estimates of the regression in multilevel models are unbiased, but have a larger sampling variance compared to OLS methods resulted estimators. Using signifi cance tests in multilevel structure models without considering this aspect could lead to misinterpretations.

MULTILEVEL REGRESSION MODEL

Multilevel models (also known as hierarchical linear models, nested models, mixed models, random coeffi cient, random-effects models, random parameter models, or split-plot designs) are statistical models of parameters that vary at more than one level. The models assume hierarchical data, in which the dependent variable is measured at the lowest level and the independent (explanatory) variables are measured at all available levels.

1. Turner, AG. (1996): Sampling Topics for Disability Surveys. United Nations Statistics Divi-sion, Technical Notes, December, http://www.undp.org/popin/demotss/tcndec96/tony.htm2. Tate, R. & Wongbundhit, Y. (1983): Random versus Nonrandom Coeffi cient Models for Mul-tilevel Analysis. Journal of Educational Statistics, 8, 103-120.


Level 1 regression can be seen as: Yt = a0 + a1Xt + et (1) Where: Yt – response variable, a0 – intercept, a1 – slope, Xt – explanatory variable et – residual

For example, let J be the number of groups and a different number of individuals Nj in each group. On the individual (lowest) level we have the dependent variable Yij and the explanatory variable Xij, and on the group level we have the explanatory variable Zj. Thus, a separate regression equation in each group can be written as following: Yij = b0j + b1j Xij + eij. (2) The bj are modeled by explanatory variables at the group level: b0j = g00 + g01 Zj + u0j, (3) b1j = g10 + g11 Zj + u1j. (4) Substitution of (3) and (4) in (2) gives: Yij = g00 + g10 Xij + g01 Zj + g11 ZjXij + u1j Xij + u0j + eij (5)

In general, there will be more than one explanatory variable at the lowest level and also more than one explanatory variable at the highest level. Assume that we have P explanatory variables X at the lowest level, specifi ed by the subscript p (p=1,P), and Q explanatory variables Z at the highest level, specifi ed by the subscript q (q=1,Q).

Then, equation (5) becomes the more general equation: Yij = g00 + gp0 Xpij + g0q Zqj + gpq ZqjXpij + upj Xpij + u0j + eij (6)

Another way to defi ne a multilevel regression is: For level 2: Yij = a0 + aiXij + alphai + betaj + eij (7) Where: alphai -> specifi c effect at level 1 betaj -> specifi c effect at level 2 alphai and/or betaj can be analyzed as fi xed or random. In the equation (7), if j=t we have a panel structure.


For level higher than 2, the second index may differ from “t”. ex. Yijk, where: k - product; j - fi rm; i - branch; ai -> constant ai=a1 (for any i) -> not constant (variable) -> constant for some explicative variables and not for others

The estimators used in multilevel analysis are Maximum Likelihood (ML), having standard errors estimated from the inverse of the information matrix. These standard errors are used in the Wald test (the test Z = parameter / (st.err. param.) is referred to the standard normal distribution to create a p-value for the null-hypothesis that in the population that specifi c parameter is null)

ACCURACY OF FIXED/RANDOM PARAMETERS AND THEIR STANDARD ERRORS

For the fi xed parameters, the estimates for the regression coeffi cients appear generally unbiased, for OLS and GLS, as well as for ML estimation. OLS estimates seem to have a larger sampling error; Kreft1 estimates that they are about 90% effi cient. There were simulation done by Van der Leeden & Busing2 and Mok3, and analytic work by Snijders & Bosker4, which suggest that a large number of groups seems more important than a large number of individuals in the group for the precision of the results. For the random parameters, the estimates of the residual error at the lowest level are mostly accurate. The group level variance components are generally underestimated (FML somewhat more that with RML). The fi ndings stated that GLS variance estimates are less accurate than ML ones, and for accurate estimates many groups (>100) may be needed5

1. Kreft, Ita G.G. (1996): Are Multilevel Techniques Necessary? An Overview, Including Simu-lation Studies. California State University, Los Angeles.2. Van Der Leeden, R. & Busing, F. (1994): First Iteration versus IGLS/RIGLS Estimates in Two-level Models: a Monte Carlo Study with ML3. Department of Psychometrica and research Methodology, Leiden University, Leiden.3. Mok, M. (1995): Sample Size requirements for 2-level Designs in Educational Research. Multilevel Models Project, University of London, London.4. Snijders, T.A.B. & Bosker, R. (1993): Modeled Variance in Two-level Models. Journal of Educational Statistics, 18, 273-259.5. Idem 7


ACCURACY AND SAMPLE SIZE

It is generally accepted that increasing sample sizes at all levels, estimates and their standard errors improve. Kreft suggests a “rule of thumb”, which she calls the ‘30/30 rule.’ To be statistically safe, researchers should use a sample of at least 30 groups with at least 30 individuals per group. From the various simulations presented above, this rule is of better use for fi xed parameters. Some specialists suggest that the numbers should be modifi ed as follows: if there is strong interest in cross-level interactions, the number of groups should be larger, (a 50/20 rule – 50 groups with 20 individuals/group); if there is stronger interest in the random part, or in the variance and/or covariance components, the amount of used groups should be considerably larger, which leads to a 100/10 rule (100 groups with 10 individuals/group). One should take into account the costs attached to data collection, so if the number of groups is increased, than the number of individuals per group might decreases.

MULTILEVEL ANALYSIS IN R

The widely used package in R for multilevel analysis is lme4. It is not installed by default, so one should call:

install.packages(“lme4”)

We use for this paper a modifi ed example from Harvey Goldstein - Datasets used in Multilevel Statistical Models; 3rd edition 2003 (http://www.bristol.ac.uk/cmm/team/hg/msm-3rd-ed/datasets.html), converted into csv fi le format for ease of use. The used dataset can be found here: http://www.bristol.ac.uk/cmm/team/hg/msm-3rd-ed/jsp-728.xls. There was a new column inserted “School_class”, generated as a random number from 1 to 4. For a complete listing, see Annex.

The used data set has the following colums: • math_yr_3 – obtained result by a student in 3rd year • math_yr_1 – obtainded result by a student in 1st year • Gender - gender of the student (1 for masculine) • Social_class • School_class – randomized class associated to schools (1 to 4) • School_ID


• Normal_score_yr_3 – average score for students in 3rd year • Normal_score_yr_1 – average score for students in 1st year

First, we used a simple regression (OLS):lm(formula = math_yr_3 ~ math_yr_1 + Gender + Social_class + School_ID + School_class, data = my.lmm.data) coef.est coef.se(Intercept) 14.54 0.98 math_yr_1 0.64 0.03 Gender -0.33 0.36 Social_class -0.71 0.40 School_ID 0.02 0.01 School_class -0.08 0.16 ---n = 728, k = 6residual sd = 4.82, R-Squared = 0.47

There are several approaches when using multilevel analysis. We are presenting here the use of lmer function for two cases: varying intercept and varying slope.

1. Fit a varying intercept model with lmer Group level variables can be specifi ed using the syntax: (1|School_ID), which tells lmer to fi t a linear model with a varying-intercept group effect using the variable School_ID> MLL.Example.6 <- lmer(math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID ), data = my.lmm.data)> display(MLL.Example.6)

lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID), data = my.lmm.data) coef.est coef.se(Intercept) 14.10 0.73 math_yr_1 0.65 0.02 Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_ID (Intercept) 1.81 Residual 4.45 ---number of obs: 728, groups: School_ID, 48AIC = 4309.5, DIC = 4286.9deviance = 4293.2


Multiple group effects can be fi tted with multiple group effect terms.> MLL.Example.7 <- lmer(math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID) + (1 | School_class), data = my.lmm.data)> display(MLL.Example.7)

lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID) + (1 | School_class), data = my.lmm.data) coef.est coef.se(Intercept) 14.10 0.73 math_yr_1 0.65 0.02 Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_ID (Intercept) 1.81 School_class (Intercept) 0.00 Residual 4.45 ---number of obs: 728, groups: School_ID, 48; School_class, 4AIC = 4311.5, DIC = 4286.9deviance = 4293.2

The nested group effect terms can be fi tted using the following syntax:> MLL.Example.8 <- lmer(math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID/School_class), data = my.lmm.data)> display(MLL.Example.8)

lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID/School_class), data = my.lmm.data) coef.est coef.se(Intercept) 14.10 0.73 math_yr_1 0.65 0.02 Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_class:School_ID (Intercept) 0.00 School_ID (Intercept) 1.81 Residual 4.45 ---number of obs: 728, groups: School_class:School_ID, 179; School_ID, 48AIC = 4311.5, DIC = 4286.9deviance = 4293.2


Here the (1|School_ID/School_class) means that we want to fi t a mixed effect term for varying intercepts 1| by schools and for classes that are nested within schools.

2. Fit a varying slope model with lmer To analyze the effect of different student level indicators as they vary across School_class-es, as an alternative to fi tting unique models by school (or School_ID/Social_class), a varying slope model can be fi tted. The random effect term can be modifi ed to include variables before the grouping terms:

(1 + Gender|School_ID/School_class)

Is interpreted by R to fi t a varying slope and varying intercept model for schools and classes nested within schools, as well as to allow the slope of the open variable to vary by School_ID.> MLL.Example.9 <- lmer(math_yr_3 ~ math_yr_1 + (1 + Gender| School_ID/School_class), data = my.lmm.data)> display(MLL.Example.9)

lmer(formula = math_yr_3 ~ math_yr_1 + (1 + Gender | School_ID/School_class), data = my.lmm.data) coef.est coef.se(Intercept) 13.96 0.72 math_yr_1 0.65 0.02

Error terms: Groups Name Std.Dev. Corr School_class:School_ID (Intercept) 0.24 Gender 0.48 -1.00 School_ID (Intercept) 1.76 Gender 0.11 1.00 Residual 4.44 ---number of obs: 728, groups: School_class:School_ID, 179; School_ID, 48AIC = 4318.1, DIC = 4288.1deviance = 4294.1


CONCLUSIONS

This paper is just an introduction to the multilevel modeling in R. The used data is half generated data and the obtained results are just for exemplifi cation of the functions. Using R for multilevel modeling is an easy and powerful way to obtain the needed results. The high fl exibility of accepted data formats recommend the use of R environment as an alternative to other commercial solutions. There are also other packages dealing with regression analysis and the on-going support of the community should help the analysts fi nd the best available solution for their analyses.

References

1. Douglas Bates, Martin Maechler, Ben Bolker and Steven Walker (2014). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-6. http://CRAN.R-project.org/package=lme4

2. Goldstein, M., (1999): Multilevel Statistical Methods, London, Institute of Education, Multilevel Models Project, April

3. Kish, L. (1965): Survey Sampling. Wiley, New York. 4. Knowles, J. E.(2013): Getting Started with Multilevel Modeling in R, http://

jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r

5. Kreft, Ita G.G. (1996): Are Multilevel Techniques Necessary? An Overview, Including Simulation Studies. California State University, Los Angeles.

6. Mok, M. (1995): Sample Size requirements for 2-level Designs in Educational Research. Multilevel Models Project, University of London, London.

7. R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/

8. Snijders, T.A.B. & Bosker, R. (1993): Modeled Variance in Two-level Models. Journal of Educational Statistics, 18, 273-259.

9. Stephen B., Raudenbush W., Anthony S. (2002): Hierarchical Linear Models: Applications and Data Analysis Methods (2. ed., [3. Dr.] ed.). Thousand Oaks, CA [u.a.]: Sage Publications. ISBN 0-7619-1904-X

10. Tate, R. & Wongbundhit, Y. (1983): Random versus Nonrandom Coeffi cient Models for Multilevel Analysis. Journal of Educational Statistics, 8, 103-120.

11. Turner, AG. (1996): Sampling Topics for Disability Surveys. United Nations Statistics Division, Technical Notes, December, http://www.undp.org/popin/demotss/tcndec96/tony.htm

12. Van Der Leeden, R. & Busing, F. (1994): First Iteration versus IGLS/RIGLS Estimates in Two-level Models: a Monte Carlo Study with ML3. Department of Psychometrica and research Methodology, Leiden University, Leiden.


ANNEX>> my.lmm.data<-read.csv(file.choose()) >> library(lme4) # load library > library(arm) # convenience functions for regression in R >> head(my.lmm.data) math_yr_3 math_yr_1 Gender Social_class School_class School_ID Normal_score_yr_3 Normal_score_yr_1 1 39 36 1 0 4 1 1.802743 1.551093 2 11 19 0 1 1 1 -2.290740 -0.980330 3 32 31 0 1 3 1 -0.041320 0.638187 4 27 23 0 0 3 1 -0.749930 -0.459870 5 36 39 0 0 1 1 0.743105 2.149517 6 33 25 1 1 1 1 0.162541 -0.181760 >> OLSexamp <- lm(math_yr_3 ~ math_yr_1 + Gender + Social_class + School_ID + School_class, data = my.lmm.data) > display(OLSexamp) lm(formula = math_yr_3 ~ math_yr_1 + Gender + Social_class + School_ID + School_class, data = my.lmm.data) coef.est coef.se (Intercept) 14.54 0.98 math_yr_1 0.64 0.03 Gender -0.33 0.36 Social_class -0.71 0.40 School_ID 0.02 0.01 School_class -0.08 0.16 --- n = 728, k = 6 residual sd = 4.82, R-Squared = 0.47 > #summary(OLSexamp) >>> MLL.Example <- glm(math_yr_3 ~ math_yr_1, data = my.lmm.data) > display(MLL.Example) glm(formula = math_yr_3 ~ math_yr_1, data = my.lmm.data) coef.est coef.se (Intercept) 13.84 0.69 math_yr_1 0.65 0.03 --- n = 728, k = 2 residual deviance = 16945.4, null deviance = 31834.8 (difference = 14889.4) overdispersion parameter = 23.3 residual sd is sqrt(overdispersion) = 4.83

>>> #Fit a varying intercept model >> MLL.Example.2 <- glm(math_yr_3 ~ math_yr_1 + Gender, data = my.lmm.data) > display(MLL.Example.2) glm(formula = math_yr_3 ~ math_yr_1 + Gender, data = my.lmm.data) coef.est coef.se (Intercept) 14.01 0.71 math_yr_1 0.65 0.03 Gender -0.36 0.36 --- n = 728, k = 3 residual deviance = 16922.1, null deviance = 31834.8 (difference = 14912.7) overdispersion parameter = 23.3 residual sd is sqrt(overdispersion) = 4.83 >> AIC(MLL.Example.2) [1] 4364.317 >>> anova(MLL.Example, MLL.Example.2, test = "F") Analysis of Deviance Table

Model 1: math_yr_3 ~ math_yr_1 Model 2: math_yr_3 ~ math_yr_1 + Gender Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 726 16945 2 725 16922 1 23.321 0.9992 0.3178 >> MLL.Example.3 <- glm(math_yr_3 ~ math_yr_1 + Gender + School_class, data = my.lmm.data) > display(MLL.Example.3) glm(formula = math_yr_3 ~ math_yr_1 + Gender + School_class, data = my.lmm.data) coef.est coef.se (Intercept) 14.15 0.82 math_yr_1 0.65 0.03 Gender -0.36 0.36 School_class -0.05 0.16 --- n = 728, k = 4 residual deviance = 16919.5, null deviance = 31834.8 (difference = 14915.4) overdispersion parameter = 23.4 residual sd is sqrt(overdispersion) = 4.83


>> AIC(MLL.Example.3) [1] 4366.204 >>> anova(MLL.Example, MLL.Example.3, test = "F") Analysis of Deviance Table

Model 1: math_yr_3 ~ math_yr_1 Model 2: math_yr_3 ~ math_yr_1 + Gender + School_class Resid. Df Resid. Dev Df Deviance F Pr(>F) 1 726 16945 2 724 16920 2 25.951 0.5552 0.5742 >> table(my.lmm.data$Gender, my.lmm.data$School_class) 1 2 3 4 0 102 100 98 87 1 91 82 79 89 >>> MLL.Example.4 <- glm(math_yr_3 ~ math_yr_1 + Gender + School_ID : School_class, data = my.lmm.data) > display(MLL.Example.4) glm(formula = math_yr_3 ~ math_yr_1 + Gender + School_ID:School_class, data = my.lmm.data) coef.est coef.se (Intercept) 13.88 0.75 math_yr_1 0.65 0.03 Gender -0.36 0.36 School_ID:School_class 0.00 0.00 --- n = 728, k = 4 residual deviance = 16915.2, null deviance = 31834.8 (difference = 14919.7) overdispersion parameter = 23.4 residual sd is sqrt(overdispersion) = 4.83 >>> #Fit a varying intercept model with lmer >> MLL.Example.6 <- lmer(math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID ), data = my.lmm.data) > display(MLL.Example.6) lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID), data = my.lmm.data) coef.est coef.se (Intercept) 14.10 0.73 math_yr_1 0.65 0.02

Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_ID (Intercept) 1.81 Residual 4.45 --- number of obs: 728, groups: School_ID, 48 AIC = 4309.5, DIC = 4286.9 deviance = 4293.2 >>> #We can fit multiple group effects with multiple group effect terms. >> MLL.Example.7 <- lmer(math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID) + (1 | School_class), data = my.lmm.data) > display(MLL.Example.7) lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID) + (1 | School_class), data = my.lmm.data) coef.est coef.se (Intercept) 14.10 0.73 math_yr_1 0.65 0.02 Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_ID (Intercept) 1.81 School_class (Intercept) 0.00 Residual 4.45 --- number of obs: 728, groups: School_ID, 48; School_class, 4 AIC = 4311.5, DIC = 4286.9 deviance = 4293.2 >> #nested group effect terms > display(MLL.Example.8) lmer(formula = math_yr_3 ~ math_yr_1 + Gender + (1 | School_ID/School_class), data = my.lmm.data) coef.est coef.se (Intercept) 14.10 0.73 math_yr_1 0.65 0.02 Gender -0.35 0.34

Error terms: Groups Name Std.Dev. School_class:School_ID (Intercept) 0.00


School_ID (Intercept) 1.81 Residual 4.45 --- number of obs: 728, groups: School_class:School_ID, 179; School_ID, 48 AIC = 4311.5, DIC = 4286.9 deviance = 4293.2 >> #Fit a varying slope model with lmer > MLL.Example.9 <- lmer(math_yr_3 ~ math_yr_1 + (1 + Gender| School_ID/School_class),+ data = my.lmm.data) > display(MLL.Example.9) lmer(formula = math_yr_3 ~ math_yr_1 + (1 + Gender | School_ID/School_class), data = my.lmm.data) coef.est coef.se (Intercept) 13.96 0.72 math_yr_1 0.65 0.02

Error terms: Groups Name Std.Dev. Corr School_class:School_ID (Intercept) 0.24 Gender 0.48 -1.00 School_ID (Intercept) 1.76 Gender 0.11 1.00 Residual 4.44 --- number of obs: 728, groups: School_class:School_ID, 179; School_ID, 48 AIC = 4318.1, DIC = 4288.1 deviance = 4294.1


Demographic Research On the Socio Economic Background of Students of the Ecological University of Bucharest Ph.D. Janina Mihaela MIHĂILĂ ([email protected]) Ecological University of Bucharest

ABSTRACT

The paper describes a socio demographic and economic research performed on the fi rst year students at the Ecological University of Bucharest, where we are fo-cusing on understanding and investigating the conditions inside the families and the social environment in the home towns of these students. This research is a key in un-derstanding the correlations between the socio-economic conditions inside the family geographical area and the actual career options and decisions of the newly admitted students to our faculties. Key Words: correlation, R programming language, demographic research, universities. JEL Classifi cation: I10, I20, I23, I25, J10.

INTRODUCTION

The importance of demographic studies in modern societies is just huge. All sort of entities, from multinational companies, to retailing giants, to government and city municipalities resort to them, in order to be able to correctly assess and estimate the impact of a certain factor on a specifi c segment of the population. For correctly taking all decisions about managing the advertorial system of the University, faculties and studies program, we must evaluate the socio economic premises of the geographical and demographical area of our potential students. The present paper focuses on 3 important aspects: - economic - social - demographical areas - cultural


For this research, our aim was to investigate the dependencies between the socio demographic factors such as: - the family’s income - regional development and specifi c opportunities - family’s level of education.

There is a clear connection between the socio-demographic environment from where a student comes, and his or her later career development and future opportunities. At the beginning of the 20th century Sigmund Freud remarked the profound connection between the development in the early years of childhood and a wide variety of options in adult life. We start by choosing a representative sample: in our case we perform the study on the entire lot of students in the fi rst year that joined the Ecological University in Bucharest. They may be considered as a representative sample, as we have people from all the regions of Romania and from a wide variety of social backgrounds. Each student was given a specifi c questionnaire, without any prior notice and asked to fi ll it in one hour. It contained questions that helped us understand and accurately represent their origin from a social, cultural and regional point of view. In this way, we have understood the complexity of factors that had led them to join this specifi c university instead of others. All answers were anonymous. The fi rst step in our project was to have the distribution characterized inside the sample considered. This study has been conducted on a sample of 249 students, having the following structure:

Gender and age defi ned structure of the sample. Table 1

GenderAge (years) F M Σ %

18 - 29 54 37 91 36.5530 - 40 76 43 119 47.79> 40 19 20 39 15.66Σ 149 100 249 100


We had a very high response rate for these questionnaires of 90,55 %, as we distributed to students a total of 275 surveys and we received a total of 249 of them back ,completed with all information required; this very high percentage was also on a large extent due to the fact that the students were provided with special designated collector box for the quesionnaires easy return. Upon receiving, the results were processed with dedicated software and we were able to have a more detailed image of the correlations of the students’s choice for a specifi c faculty.

Sample’s structure compared to the district’s population numberTable 2

District Population Weight Respondents %

D1 1628426 0.25 67 26.90

D2 301425 0.05 23 9.24

D3 316652 0.05 10 4.02

D4 674903 0.1 6 2.40

D5 680945 0.11 24 9.64

D6 353481 0.05 5 2.01

D7 654870 0.1 11 4.42

D8 540508 0.08 13 5.22

D9 211622 0.03 18 7.23

D10 374240 0.06 37 14.86

D11 393340 0.06 16 6.43

D12 339510 0.05 19 7.63

Σ 6469922 1 249 100


INPUT DATA

Respondent’s structure by educational level and districtTable 3

Family’s level of education District of Birth

Middle school

High school graduation

Bachelor’s/master’s degree Σ

D1 0 23 44 67D2 2 9 12 23D3 0 3 7 10D4 0 1 5 6D5 1 7 16 24D6 0 2 3 5D7 1 5 5 11D8 0 4 9 13D9 3 8 7 18D10 3 15 19 37D11 1 10 5 16D12 2 12 5 19Σ 13 99 137 249% 5.22 39.76 55.02 100

Table 4. Respondent’s family monthly income by district of birth.

Family monthly income (RON)

District of Birth

Under 2000

2001 – 4000

4001 – 6000

6001 –10,000

Above 10,000

D1 32 14 19 1 1D2 18 2 3 0 0D3 7 2 1 0 0D4 5 1 0 0 0D5 17 3 3 1 0D6 3 1 1 0 0D7 9 1 0 0 1D8 11 2 0 0D9 13 5 0 0 0D10 16 15 5 1 0D11 10 4 2 0 0D12 6 11 2 0 0Σ 147 61 36 3 2% 59.04 24.50 14.46 1.20 0.80


Remark. NUTS 3 statistical regions of the European Union. There are three levels defi ned in the Nomenclature of Territorial Units for Statistics (NUTS). The above category refers to regions belonging to the third level (NUTS 3, also known as NUTS III), which is largely used by Eurostat and other European Union bodies.

Table 5. Respondent’s structure by demographic areas and district of birth.

Demographical area of birth District of Birth

Rural Intermediar(suburban) Urban Σ

D1 5 28 34 67D2 7 1 15 23D3 2 6 2 10D4 1 3 2 6D5 4 8 12 24D6 0 0 5 5D7 3 2 6 11D8 4 5 4 13D9 2 7 9 18D10 9 9 19 37D11 6 7 3 16D12 8 4 7 19Σ 51 80 118 249% 20.48 32.13 47.39 100.00

STATISTICAL ANALYSIS

Calculation of the correlation coeffi cient (Pearson). We compute the correlation coeffi cient or Pearson coeffi cient interpreted according to Colton empirical rules (Theodore Colton, Professor Boston University). The correlation coeffi cient or Pearson coeffi cient is an indicator independent of the units of measure of the two variables

SySx

YXCOVr ),(

where SX and SY represent the standard deviations for the X and Y series respectively.


Indicators defi nition. In order to evaluate in an exact numerical expression the possibility and the individual decision to enroll in a higher education institution, we defi ne the following indicators: 1. accessibility to higher education services indicator ACIHE as an

expression of the level of accessibility to higher education programs. 2. aspiration to higher education services indicator ASIHE expressing

the extent in which the individual is determined to enroll in a higher education study program.

Calculation Formula For answers Q7- Q14 items are assigned with numerical values corresponding to different upward variants, which were marked with values from 0 to n. The following formulas are being used: 1. ACIHE indicator is calculated as the sum of the corresponding

values given by respondents to the items: Q10, Q12, and Q14 in the questionnaire.

2. ASIHE indicator is calculated as that sum of the corresponding values recorded by respondents to the items: Q7, Q8, Q9 from questionnaire.

TECHNIQUES AND STATISTICAL METHODS OF CALCULATION USED.

The R programming language was used for the input, storage and calculation of data in order to better familiarize the students with this programming language and develop their skills during this process. In developing this application, the following instructions or package instructions (scripts) were used:

a. for data table creation we used: - the read.table command, but also the alternative read.table(“clipboard”) command may be used if the data are saved in .xsl format


tabel s (student) s <- read.table(text= “District Middle High Bachelor Master D1 7 54 25 5 D2 3 16 18 2 D3 1 10 12 4 D4 0 10 19 1 D5 1 1 0 0 “, header=TRUE) s

matrix type entities through: cbind() or rbind() functions data.frame objects

b. in performing the various numerical calculations on data we used: - storing it in objects: variables, vectors, matrixes, tables or data frames - the interactive way

c. calculating the Pearson correlation coeffi cient for the corresponding data sets with the statistical correlation function

d. viewing the processing objects that have been used and the search results: - a pie chart showing the distribution of the income vector

pie(stud, main=”Student’s income in District 1 (RON)”, col=rainbow(length(stud)), labels=c(“<2000”,”2000-4000”,”4000-6000”,”6000-8000”,”>8000”))legend(1.25,0.5,c(“<2000”,”2000-4000”,”4000-6000”,”6000-8000”,”>8000”), cex=0.8, fi ll=colors)

- histogram showing the distribution of the students vector

brk <- c(0,3,4,5,6,10,16)hist(Students, col=heat.colors(length(brk)), breaks=brk, xlim=c(0,max_num), right=F, main=” Student Density”, las=1, cex.axis=0.8, freq=F)

- a line chart with two axes and a legend in which we compute the y-axis values using the max function so any changes of these data will be automatically refl ected in the graph.


High school<- c(23, 9, 3, 1, 7)Bachelor <- c(44, 12, 7, 5, 16)g_range <- range(0, High school,Bachelor)plot(Highschool, type=”o”, col=”blue”, ylim=g_range, axes=FALSE, ann=FALSE)axis(1, at=1:5, lab=c(“D1”,”D2”,”D3”,”D4”,”D5”))axis(2, las=1, at=10*0:g_range[2])box()lines(Bachelor, type=”o”, pch=22, lty=2, col=”red”)title(main=”CHART3”, col.main=”red”, font.main=4)title(xlab=”Districts”, col.lab=rgb(0,0.5,0))title(ylab=”Number”, col.lab=rgb(0,0.5,0))legend(1, g_range[2], c(“Highschool”,”Bachelor”), cex=0.8, col=c(“blue”,”red”), pch=21:22, lty=1:2);

RESULTS OBTAINED AND THEIR INTERPRETATION

Indicator calculation Indicator calculation is performed using the R system, thus speeding up the process, using the following code:

i <- read.table(text= “Respondent VQ6 VQ7 VQ8 VQ9 R1 7 2 8 5 R2 3 2 6 2 R3 1 1 3 4 R4 0 0 2 1 R5 1 1 5 0 “, header=TRUE) i i$VQ7 i$VQ8 ACIHE=i$VQ7+i$VQ8 ACIHE

The output is showed on the running window.


Indicator calculation Fig. 1

Calculating correlations

We compute the Pearson correlation coeffi cient of the values recorded from the respondents to the items Q7 and Q8. For this we defi ne a table and use the correlation between two columns of the table: VQ7 and VQ8 considered as vectors in which we stored the values corresponding to the responses.

VQ7=i$VQ7VQ8=i$VQ8

cor(VQ7,VQ8, method=”pearson”)

Following the application of the calculation formulas and statistical functions, the following results were displayed in the R window:


Calculating correlation Fig. 2

Pearson correlation coeffi cient is 0.91 which indicates a very good association or correlation between the level of education of the respondent and his/her family (a correlation coeffi cient greater than 0.75 or less than - 0.75).

Displaying the results.

Student’s income in District 1 (RON) Fig. 3


Student Density Fig. 4


Number of students by educational profi le (High School and Bachelor) in districts

Fig 5

CONCLUSIONS

The aim of this paper has been to provide a document to be read and understood by students, their families and also to be accessible for a larger audience. We must keep into account that the target audience does not have deep statistical and IT knowledge, but are nevertheless very keen in knowing the results. We have wanted to share this knowledge with the wider public and to have it used and further extended in future studies for a didactical purpose,


mostly because it has practical uses in fi elds such as economics, technics, social sciences and more. We are strongly encouraging the students to continue working on this area and further develop these knowledge. The scope of developing such an application is on various areas, such as: - didactical, to provide practical examples for these theoretical notions

and concepts; - organizational, so that both teachers and students may better organize

their teaching/learning activities together; - practical, for developing the skills and abilities that the students will

need for successfully building up their future career paths; - management, in promoting an authentic university quality

management system; - personal development, through self-knowledge and self-awareness

of their own individual motivations and educational aspirations.

RECOMMENDATIONS AND DIRECTIONS FOR FURTHER DEVELOPMENT

We regard as useful and welcomed the development of case study themes, either individual or in group, (which will form small working groups), leading to the built up of scientifi c and research skills, helpful for elaborating the undergraduate thesis and diploma dissertation preparation.

References

1. Colton T., Freedman L. S. and Johnson A. L.. Statistics in medicine, Vol. 1, No. 1, January, 1982, New York, John Wiley and Sons.

2. Colton, T. Statistics in Medicine. Little Brown and Company, New York, NY 1974.

3. R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

4. Venables W. N., Smith D. M., R Core Team. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics. 2012.

5. Murrell, P. (2005) R Graphics. Chapman & Hall/CRC Press. 6. http://cran.r-project.org/doc/manuals/R-intro.html 7. http://www.r-project.org/


ANNEXSample questionnaire

Q1. Please indicate your sex F MQ2. Please indicate your year of birth ……………………Q3. Do you have siblings Yes/No and if yes, how many………………Q4. Ethnic origin ………………………………………………..Q5. Marital status ……………………………………………………….Q6. Family background………………………………………………………..Q7. Your level of education (highest level of education) A. Middle school B. High school graduation C. Bachelor’s degree or master’s degreeQ8. What is the highest level of education your parents have completed? If currently enrolled, highest degree received. A. No schooling completed B. Nursery school to 8th grade C. Some high school, no diploma D. High school graduate, diploma or the equivalent (for example: GED) E. Some college credits, no degree F. Trade/technical/vocational training G. Bachelor’s degree H. Master’s degree I. PhD degreeQ9. Grandparents highest level of education …………………………………Q10. Are you employed? A. full time B. part time C. unemployedQ11. Employment contract: A. none B. temporary C. permanentQ12. The time schedule allows you to be present in class during teaching/learning activities? A. weekends only B. partially C. completelyQ13. What kind of area were you raised in?A. rural B. small town C. suburban D. urbanQ14. Please report an estimate of your family’s total monthly income: A. Under RON 2.000 B. Between RON 2,001 and 4.000 C. Between RON 4,001 and 6.000 D. Between RON 6,001 and 10.000 E. More than RON 10.000Q15. What were your hometown extra-curricular activities available during your


years in school? Please list the ones you followed – if any............................Q16. Did your teaching institution in your hometown provided access to the following activities: A. library B. gym C. IT lab with internet D. uptodate knowledge on student grants, exchange programs and other facilitiesQ17. Did your home town had:A. Museums B. Cinemas C. Theatres D. Opera houses E. Stadiums


Integrating R and Hadoop for Big Data Analysis Bogdan OANCEA ([email protected]) “Nicolae Titulescu” University of Bucharest Raluca Mariana DRAGOESCU ([email protected]) The Bucharest University of Economic Studies

ABSTRACT

Analyzing and working with big data could be very diffi cult using classical means like relational database management systems or desktop software packages for statistics and visualization. Instead, big data requires large clusters with hundreds or even thousands of computing nodes. Offi cial statistics is increasingly considering big data for deriving new statistics because big data sources could produce more relevant and timely statistics than traditional sources. One of the software tools suc-cessfully and wide spread used for storage and processing of big data sets on clusters of commodity hardware is Hadoop. Hadoop framework contains libraries, a distributed fi le-system (HDFS), a resource-management platform and implements a version of the MapReduce programming model for large scale data processing. In this paper we investigate the possibilities of integrating Hadoop with R which is a popular software used for statistical computing and data visualization. We present three ways of inte-grating them: R with Streaming, Rhipe and RHadoop and we emphasize the advan-tages and disadvantages of each solution. Keywords: R, big data, Hadoop, Rhipe, RHadoop, Streaming JEL Classifi cation: L8, C88

1. INTRODUCTION

The big data revolution will transform the way we understand the surrounding economic or social processes. We can no longer ignore the enormous volume of data being produced every day. The term “big data” was defi ned as data sets of increasing volume, velocity and variety (Mayer-Schönberger, 2012), (Beyer, 2011). Big data sizes are ranging from a few hundreds terabytes to many petabytes of data in a single data set. Such amount of data is hard to be managed and processed with classical relational database management systems and statistics and visualization software packages – it requires high computing power and large storage devices. Offi cial statistics need to harness the potential of big data to derive more relevant and timely statistics but this not an easy process. The fi rst step


is to identify the sources of big data possible to be used in offi cial statistics. According to (HLG, 2013) large data sources that can be used in offi cial statistics are: • Administrative data; • Commercial or transactional data, such as on-line transactions using

credit cards; • Data provided by sensors (satellite imaging, climate sensors, etc.); • Data provided by tracking devices (GPS, mobile devices, etc.); • Behavioral data (for example Internet searches); • Data provided by social media.

Using big data in offi cial statistics raises several challenges (HLG, 2013). Among them we ca mention: legislative issues, maintaining the privacy of the data, fi nancial problems regarding the cost of sourcing data, data quality and suitability of statistical methods and technological challenges. At this time there are several international initiatives that try to outline an action plan for using Big Data in offi cial statistics: Eurostat Task Force on Big Data, UNECE’s Big data HLG project. At this time there are some ongoing projects that already used big data for developing new statistics implemented by statistical agencies. We can mention (HLG, 2013): • Traffi c and transport statistics computed by Statistics Netherlands

using traffi c loop detection records generated every day. There are 10,000 detection loops on Dutch roads that produce 100 million records every day;

• Social media statistics computed also by Statistics Netherlands. Dutch Twitter produces around 1 million public social media messages on a daily basis. These messages were analyzed from the perspective of content and sentiment;

• The software developed at Eurostat for price scrapping from the Internet to assist in computing the Consumer Price Index;

• The Billion project developed at MIT (http://bpp.mit.edu/) is a project that collect prices from retailers around the world to conduct economic research;

• Tourism Statistics developed in Estonia by using mobile positioning data (Ahas, 2013);

In this paper we will investigate a technological problem – we will present a way of integrating Hadoop (White, 2012), a software framework for distributed computing used for big data processing with R (R Core Team,


2013) which is a popular statistics desktop package. The paper is structured as follows: in section 2 we will introduce R and Hadoop. Next we will show how these two software packages can be integrated in order to be used together for big data statistical analysis and section 4 will conclude our presentation.

2. R AND HADOOP - SOFTWARE TOOLS FOR LARGE DATA SETS STATISTICAL ANALYSIS

R is a free software package for statistics and data visualization. It is available for UNIX, Windows and MacOS platforms and is the result of the work of many programmers from around the world. R contains facilities for data handling, provides high performance procedures for matrix computations, a large collection of tools for data analysis, graphical functions for data visualization and a straightforward programming language. R comes with about 25 standard packages and many more packages available for download through the CRAN family of Internet sites (http://CRAN.R-project.org). R is used as a computational platform for regular statistics production in many offi cial statistics agencies (Todorov, 2010), (Todorov, 2012). Besides offi cial statistics, it is used in many other sectors like fi nance, retail, manufacturing, academic research etc., making it a popular tool among statisticians and researchers. Hadoop is a free software framework developed with the purpose of distributed processing of large data sets using clusters of commodity hardware, implementing simple programming models (White, 2013). It is a middleware platform that manages a cluster of computers that was developed in Java and although Java is main programming language for Hadoop other languages could be used to: R, Python or Ruby. Hadoop is available at http://hadoop.apache.org/. One of the biggest users of Hadoop is Yahoo!. Yahoo! uses Hadoop for the Yahoo! Search Webmap which is an application that runs on a very large cluster and produces data used in Yahoo! Web search queries (Yahoo! Developer Network, 2014). Another Hadoop important user is Facebook that operated a Hadoop cluster with more than 100 PB of data in 2012 (Ryan, 2012). The Hadoop framework includes:

• Hadoop Distributed File System (HDFS) - a high performance distributed fi le system;

• Hadoop YARN which is a framework for job scheduling and cluster resource management;

• Hadoop MapReduce – a system for parallel processing of large data sets that implements the MapReduce model of distributed programming (Dean, 2004).


In brief, Hadoop provides a reliable distributed storage through HDFS and an analysis system by MapReduce. It was designed to scale up from a few servers to hundreds or thousands of computers, having a high degree of fault tolerance. Hadoop is now a de-facto standard in big data processing and storage, it provides unlimited scalability and is supported by major vendors in the software industry. Hadoop Distributed File System relies on a client/server architecture consisting in a single NameNode implemented on a master server that manages the fi le system namespace and a number of DataNodes which manage storage attached to the nodes. Files are split into blocks that are stored in a set of DataNodes. The NameNode is responsible with operations like opening, closing or renaming fi les while DataNodes are responsible for responding to read or write requests from clients. MapReduce is a model for processing large sets of data in-parallel on large clusters computers. It splits the input data in chucks that are processed in parallel by the map tasks. The results of the map tasks are sorted and forwarded as inputs to the reduce tasks that performs a summary operation. The framework that implements the MapReduce paradigm should marshal the distributed servers, run tasks in parallel, manage the data transfers between the nodes of the cluster, and provide fault tolerance. Hadoop MapReduce hides the parallelism from the programmer, presenting him a simple model of computation. The main features of the Hadoop framework can be summarized as follows: • High degree of scalability: new nodes can be added to a Hadoop

cluster as needed without changing data formats, or application that runs on top of the FS;

• Cost effective: it allows for massively parallel computing using commodity hardware;

• Flexibility: Hadoop differs from RDBMS, being able to use any type of data, structured or not;

• Fault tolerance: if a node fails from different reasons, the system sends the job to another location of the data and continues processing.

Hadoop has also a series of limitations which can be summarized as follows:

• HDFS is an append-only fi le system, it doesn’t allow update operations;


• MapReduce jobs run in batch mode. That’s why Hadoop is not suited for interactive applications;

• Hadoop cannot be used in transactional applications.

Data analysts who work with Hadoop may have a lot of R scripts/packages that they use for data processing. Using these scripts/packages with Hadoop normally requires rewriting them in Java or other language that implements MapReduce. This is cumbersome and could be a diffi cult and error prone task. What we need is a way to connect Hadoop with R and use the software already written for R with the data stored in Hadoop (Holmes, 2012). Another reason for integrating R with Hadoop for large data sets analysis is the way R works – it processes the data loaded in the main memory. Very large data sets (TB or PB) cannot be loaded in the RAM memory and for these data Hadoop integrated with R is one of the fi rst choice solutions. Although there are many solutions for using R on a high performance computing environment ( snow, rmpi or rsge) all these solutions require that the data must be loaded in memory before the distribution to computing nodes and this is simple not possible for very large data sets.

3. R AND HADOOP INTEGRATION

We will present three approaches to integrate R and Hadoop: R and Streaming, Rhipe and RHadoop. There are also other approaches to integrate R and Hadoop. For example RODBC/RJDBC could be used to access data from R but a survey on Internet shows that the most used approaches for linking R and Hadoop are Streaming, Rhipe (Cleveland, 2010) and RHadoop (Prajapati, 2013). The general structure of the analytics tools integrated with Hadoop can be viewed as a layered architecture presented in fi gure 1. The fi rst layer is the hardware layer – it consists in a cluster of (commodity) computers. The second layer is the middleware layer – Hadoop. It manages the distributions of the fi les by using HDFS and the MapReduce jobs. Then it comes a layer that provides an interface for data analysis. At this level we can have a tool like Pig which is a high-level platform for creating MapReduce programs using a language called Pig-Latin. We can also have Hive which is a data warehouse infrastructure developed by Apache and built on top of Hadoop. Hive provides facilities for running queries and data analysis using an SQL-like language called HiveQL and it also provides support for implementing MapReduce tasks.


Besides these two tools we can implement at this level an interface with other statistical software like R. We can use Rhipe or Rhadoop libraries that build an interface between Hadoop and R, allowing users to access data from the Hadoop fi le system and write their own scripts for implementing Map and Reduce jobs, or we can use Streaming that is a technology integrated in Hadoop. Hadoop can be also integrated with other statistical software like SAS or SPSS.

Hadoop and data analysis toolsFigure 1

Computer 1 Computer 2 Computer n The hardware layer

Hadoop distributed file system

MapReduce

Hadoop framework

Statistical software

Pig Hive Rhipe RHadoopStreaming

R

R R

Middleware layer

Interface layer

We will analyze these options of integration between R and Hadoop from different point of views: licensing, complexity of installation, benefi ts and limitations.

R AND STREAMING

Streaming is a technology integrated in the Hadoop distribution that allows users to run Map/Reduce jobs with any script or executable that reads data from standard input and writes the results to standard output as the mapper or reducer. This means that we can use Streaming together with R scripts in the map and/or reduce phase since R can read/write data from/to standard


input. In this approach there is no client-side integration with R because the user will use the Hadoop command line to launch the Streaming jobs with the arguments specifying the mapper and reducer R scripts. A command line with map and reduce tasks implemented as R scripts would look like this:

An example of a map-reduce task with R and Hadoop integrated by Streaming framework

Figure 2$ ${HADOOP_HOME}/bin/Hadoop jar${HADOOP_HOME}/contrib/streaming/*.jar \-inputformat org.apache.hadoop.mapred.TextInputFormat \-input input_data.txt \-output output \-mapper /home/tst/src/map.R \-reducer /home/tst/src/reduce.R \-fi le /home/tst/src/map.R \-fi le /home/tst/src/reduce.R

Here we supposed that the data in the fi le “input_data.txt” was already copied from the local fi le system to the HDFS. The meaning of the command line parameters are: “-inputformat org.apache.hadoop.mapred.TextInputFormat” specifi es the input format for the job (we stored our input data in a text fi le); “-input input_data.txt” specifi es the input data fi le of our job; “-output output” sets the output directory of the job; “-mapper /home/tst/src/map.R” specifi es the map phase executable. In this example we used an R script named map.R located in /home/tst/src/ directory; “-reducer /home/tst/src/reduce.R” specifi es the reduce phase executable. In our example, the reducer is also an R script named reduce.R located in /home/tst/src/ directory; “-fi le /home/tst/src/map.R” indicates that the R script map.R should be copied to the distributed cache, made available to the map tasks and causes the map.R script to be transferred to the cluster machines where the map-reduce job will be run; “-fi le /home/tst/src/reduce.R” indicates that the R script reduce.R should be copied to the distributed cache an made available to the map tasks and causes the reduce.R script to be transffered to the cluster machines where the map-reduce job will be run.


The integration of R and Hadoop using Streaming is an easy task because the user only needs to run Hadoop command line to launch the Streaming job specifying the mapper and reducer scripts as command line arguments. This approach requires that R should be installed on every DataNode of the Hadoop cluster but this is simple task. The licensing scheme need for this approach implies an Apache 2.0 license for Hadoop and a combination of GPL-2 and GPL-3 for R.

RHIPE

Rhipe stands for “R and Hadoop Integrated Programming Environment” and is an open source project that provides a tight integration between R and Hadoop. It allows the user to carry out data analysis of big data directly in R, providing R users the same facilities of Hadoop as Java developers have. The software package is freely available for download at www.datadr.org. The installation of the Rhipe is somehow a diffi cult task. On each DataNode the user should install R, Protocol Buffers and Rhipe and this is not an easy task: it requires that R should be built as a shared library on each node, the Google Protocol Buffers to be built and installed on each node and to install the Rhipe itself. The Protocol Buffers are needed for data serialization, increasing the effi ciency and providing interoperability with other languages. The Rhipe is an R library which allows running a MapReduce job within R. The user should write specifi c native R map and reduce functions and Rhipe will manage the rest: it will transfer them and invoke them from map and reduce tasks. The map and reduce inputs are transferred using a Protocol Buffer encoding scheme to a Rhipe C library which uses R to call the map and reduce functions. The advantages of using Rhipe and not the parallel R packages consist in its integration with Hadoop that provides a data distribution scheme using Hadoop distributed fi le system across a cluster of computers that tries to optimize the processor usage and provides fault tolerance. The general structure of an R script that uses Rhipe is shown in fi gure 3 and one can easily note that writing such a script is very simple.


The structure of an R script using RhipeFigure 3

1 library(Rhipe)2 rhinit(TRUE, TRUE);

3 map<-expression ( {lapply (map.values, function(mapper)…)})

4 reduce<-expression(5 pre = {…},6 reduce = {…},7 post = {…}, 8 )

9 x <- rhmr(10 map=map, reduce=reduce,11 ifolder=inputPath, 12 ofolder=outputPath,13 inout=c(‘text’, ‘text’), 14 jobname=’a job name’))

15 rhex(z)

The script should begin with loading the Rhipe library into memory (line 1) and initializing the Rhipe library (line 2). Line 3 defi nes the map expression to be executed by the Map task. Lines 4 to 8 defi nes the reduce expression. Lines 4 to 8 defi ne the reduce expression consisting in three callbacks. The pre block of instructions (line 5) is called for each unique map output key before these values being sent to the reduce block. The reduce block (line 6) is called then with a vector of values as argument and in the end, the post block (line 7) is called to emit the output (key, value) pair. Line 9 shows the call of rhmr function which set up the job (creates a MapReduce object) and rhex function call (line 15) that launches the MapReduce job to the Hadoop framework. Rhipe also provides functions to communicate with Hadoop during the MapReduce process like rhcollect that allows writing data to Hadoop MapReduce or rhstatus that returns the status of the a job. Rhipe let the user to focus on data processing algorithms and the diffi culties of distributing data and computations across a cluster of computers are handled by the Rhipe and library and Hadoop.


The licensing scheme needed for this approach implies an Apache 2.0 license for Hadoop and Rhipe and a combination of GPL-2 and GPL-3 for R.

RHADOOP

RHadoop is an open source project developed by Revolution Analytics (http://www.revolutionanalytics.com/) that provides client-side integration of R and Hadoop. It allows running a MapReduce jobs within R just like Rhipe and consist in a collection of four R packages: • plyrmr - plyr-like data processing for structured data, providing

common data manipulation operations on very large data sets managed by Hadoop;

• rmr – a collection of functions providing and integration of R and MapReduce model of computation;

• rdfs – an interface between R and HDFS, providing fi le management operations within R;

• rhbase - an interface between R and HBase providing database management functions for HBase within R;

Setting up RHadoop is not a complicated task although RHadoop has dependencies on other R packages. Working with RHadoop implies to install R and RHadoop packages with dependencies on each Data node of the Hadoop cluster. RHadoop has a wrapper R script called from Streaming that calls user defi ned map and reduce R functions. RHadoop works similarly to Rhipe allowing user to defi ne the map and reduce operation. A script that uses RHadoop looks like:

The structure of an R script using RHadoopFigure 4

1 library(rmr)2 map<-function(k,v) { …}3 reduce<-function(k,vv) { …}4 mapreduce( input =”data.txt”, output=”output”, textinputformat =rawtextinputformat,map = map,reduce=reduce)


First, the rmr library is loaded into memory (line 1) and then follows the defi nition of the map function which receives a (key,value) pair as input. The reduce function (line 3) is called with a key and a list of values as arguments for each unique map key. Finally, the scripts sets up and run the mapreduce job (line 4). It should be noted that rmr makes the client-side R environment available for map and reduce functions. The licensing scheme needed for this approach implies an Apache 2.0 license for Hadoop and RHadoop and a combination of GPL-2 and GPL-3 for R.

4. CONCLUSIONS Offi cial statistics is increasingly considering big data for building new statistics because its potential to produce more relevant and timely statistics than traditional data sources. One of the software tools successfully used for storage and processing of big data sets on clusters of commodity hardware is Hadoop. In this paper we presented three ways of integrating R and Hadoop for processing large scale data sets: R and Streaming, Rhipe and RHadoop. We have to mention that there are also other ways of integrating them like ROBDC, RJBDC or Rhive but they have some limitations. Each of the approaches presented here has benefi ts and limitations. While using R with Streaming raises no problems regarding installation, Rhipe and RHadoop requires some effort in order to set up the cluster. The integration with R from the client side part is high for Rhipe and Rhadoop and is missing for R and Streaming. Rhipe and RHadoop allows users to defi ne and call their own map and reduce functions within R while Streaming uses a command line approach where the map and reduce functions are passed as arguments. Regarding the licensing scheme, all three approaches require GPL-2 and GPL-3 for R and Apache 2.0 for Hadoop, Streaming, Rhipe and RHadoop. We have to mention that there are other alternatives for large scale data analysis: Apache Mahout, Apache Hive, commercial versions of R provided by Revolution Analytics, Segue framework or ORCH, an Oracle connector for R but Hadoop with R seems to be the most used approach. For simple Map-Reduce jobs the straightforward solution is Streaming but this solution is limited to text only input data fi les. For more complex jobs the solution should be Rhipe or RHadoop.


References

1. Ahas, R., and Tiru, M., (2013) Using mobile positioning data for tourism statistics: Sampling and data management issues, NTTS - Conferences on New Techniques and Technologies for Statistics, Bruselles.

2. Beyer, M., (2011), “Gartner Says Solving ‘Big Data’ Challenge Involves More Than Just Managing Volumes of Data”. Gartner, available at http://www.gartner.com/newsroom/id/1731916, accessed on 25th March 2014.

3. Cleveland, William S., Guha, S., (2010), Computing environment for the statistical analysis of large and complex data, Doctoral Dissertation, Purdue University West Lafayette.

4. Dean, J., and Ghemawat, S., (2004), “MapReduce: Simplifi ed Data Processing on Large Clusters”, available at http://static.googleusercontent.com/media/research.google.com/ro//archive/mapreduce-osdi04.pdf, accessed on 25th March 2014.

5. High-Level Group for the Modernisation of Statistical Production and Services (HLG), (2013), What does “big data” mean for offi cial statistics?, UNECE, available at http://www1.unece.org/stat/platform/pages/viewpage.action?pageId=77170614, accessed on 25th March 2014.

6. Holmes, A. (2012), Hadoop in practice, Manning Publications, New Jersey.

7. Mayer-Schönberger, V. , and Cukier, K., (2012), “Big Data: A Revolution That Transforms How we Work, Live, and Think”, Houghton Miffl in Harcourt.

8. Prajapati, V., (2013), Big data analysis with R and Hadoop, Pakt Publishing.

9. R Core Team, (2013), An Introduction to R, available at http://www.r-project.org/, accessed on 25th March 2014.

10. Ryan, A., (2012), Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode, available at http://www.facebook.com/notes/facebook-engineering/under-the-hood-hadoop-distributed-filesystem-reliability-with-namenode-and-avata/10150888759153920, last accessed on 25th March, 2014.

11. Todorov, V. and M. Templ, (2012), R in the statistical offi ce: Part 2, Development, policy, statistics and research branch working paper 1/2012., United Nations Industrial Development, 2012.

12. Todorov, V., (2010), R in the statistical offi ce: The UNIDO experience. Working Paper 03/2010 1, United Nations Industrial Development. Available at: http://www.unido.org/fi leadmin/user_media/Services/Research_and_Statistics/statistics/WP/WP_2010_03.pdf, accessed on 25th March 2014.

13. White, T., (2012), Hadoop: The Defi nitive Guide, 3rd Edition, O’Reilly Media.

14. Yahoo! Developer Network, (2014), Hadoop at Yahoo!, available at http://developer.yahoo.com/hadoop/, last accessed on 25th March, 2014.


Methodological considerations on the size of Coeffi cient of Intensity of Structural Changes (CISC) Dr. Florin Marius PAVELESCU (pavelescu.fl [email protected]) Institute of National Economy

ABSTRACT The paper brings arguments in favour of emphasizing the modeling factors of the Coeffi cient of Intensity of Structural Changes (CISC) in order to obtain a better in-terpretation of the signifi cance of the respective method of structural change measure-ment. Also, it is highlighted the impact of characteristic features of structural changes on differentiation of the size of CISC computed at economic branch level and sectorial level respectively. There are identifi ed all possible situations of structural changes from a sectorial point of view. At the end of the paper, there is presented a numerical ex-ample related to structural changes of Romania’s employed population during the pe-riod 2008-2011. The above-mentioned example offers an opportunity to review all the necessary steps for identifi cation of CISC modeling factors, when economic branches approach is considered, and a comparison with CISC computed in a sectorial vision is made. The respective steps were made by using R Software. Key-words: transfer of weights, informational energy, main and secondary sense of sectorial structural change, intrasectorial structural changes JEL Classifi cation: C02, C18, O11

Economic development is accompanied as a rule by structural changes. Consequently, it is important to defi ne and use methods of measurement, which permit to reveal the features of respective changes. Among the methods of measurement built in order to attain the above-mentioned objective we may consider the Coeffi cient of Intensity of Structural Changes (CISC). Therefore, CISC was used in papers, such as E. Dobrescu (1968) and E. Dobrescu (2009), in order to analyze the behaviour of leading indicators during periods which are defi ned by fast economic growth or by ample transformations of economic mechanism. It is important to note that the size of CISC depends on the level of aggregation of data, in other words on the number of considered economic branches or sectors. Therefore, it is very important to review some algebraic properties of CISC and explain the relationship between above-mentioned coeffi cient computed at economic branch level and sectorial level respectively.


1. ALGEBRAIC PROPERTIES OF COEFFICIENT OF INTENSITY OF STRUCTURAL CHANGES

Usually, the Coeffi cient of Intensity of Structural Changes (CISCr) is computed with the help of formula:

21 0

1( )

r

ii

CISCr g gi=

= −∑ (1)

where: r = number of considered economic branches gi1, gi0= the weight of economic branch i in the analyzed indicator in year 1 and year 0, respectively.

We may observe that there are economic branches, which are faced with an increase of their weights, while there are economic branches, which experience a decrease of their weights. Because we have: 0 1

1 11

n n

i ii i

g g= =

= =∑ ∑ , we may observe that sum of

the gains in weights is equal with the sum of absolute values of the losses in weights. We may defi ne the sum of gains registered by weights of some economic branches during the analyzed period as “transfer of weights” (Twr). If we divide the considered economic branches in ‘winners” and “losers” of weights, we are able to write the formula of CISCr such as:

2 21 0 1 0

1 12 2

( ) ( )p q

j j k kj k

g g g gCISCr Twr

Twr Twr= =

− −= ⋅ +

∑ ∑ (2),

where: p= the number of economic branches which register gains in their weights q= the number of economic branches which register losses in their weights


We may notice that expressions

21 0

12

( )p

j jj

g gIEp

Twr=

−=

∑ (3) and

21 0

12

( )q

k kk

g gIEq

Twr=

−=

∑ (4) are in fact informational energies if we consider

O.Onicescu and M. Botez (1985).

The maximum value of IEp is equal with 1 and occurs if only one of the differences 1 0( )j jg g− is equal with Twr, the rest of the respective differences being equal with 0. Analogously, maximum value of IEq is 1 and is observed when only one of the differences 1 0( )j jg g− is equal with Twr, the rest of the respective differences being equal with 0.

Consequently, we may write: CISCr Twr IEp IEq= ⋅ + (5)

The maximum value of the expression IEp IEq+ is 2 .

The minimum value of IEp is equal with (1/p) and is obtained when all the differences 1 0( )j jg g− are equal with (Twr/p). Analogously, the minimum value of IEq is equal with (1/q) and is obtained when all the differences

1 0( )j jg g− are equal with (TWr/q). Because r = p+q, the absolute minimum of expression IEp IEq+

is obtained if p=q and is equal with 4r

, equivalent with 2p .

If r=2p-1, the absolute minimum of the expression IEp IEq+ is

equal with 2 1

( 1)p

p p⋅ −

⋅ −, equivalent with

( )r

p r p⋅ −.

The ratio between the minimum and maximum value of the expression

IEp IEq+ , respectively (RIEmin/max) is equal with 2 ( )

rp r p⋅ −

. If p =

q, we have min/ max1RIEp

= (6)


In these conditions, we may write: 22 qp IEIE

TwCISCr+

⋅⋅=(7)

The expression 2

p qr

IE IEDCSC

+= (8) may be considered as

degree of concentration of structural changes in the context of an economy with r branches, because it represents the ratio between the registered concentration of the structural changes both from point of view of gaining and loosing of relative importance and the maximum value of the respective concentration. In other words, we are able to express CISCr as:

2CISCr Twr DCSCr= ⋅ ⋅ (9)

2. PARTICULAR FEATURES OF CISC COMPUTED IN CONDITIONS OF A TRISECTORIAL VISION

If the vision initiated by Colin Clark (1960) and consacrated in Y.Sabolo, I. Gaude and R. Wery (1974) is adopted, economic branches may be grouped in three economic sectors, i.e. primary sector (agriculture, forestry, hunting and fi shing), secondary (industrial) sector (mining and quarrying, manufacturing, electricity, gas, steam and air conditioning production and supply) and tertiary sector (services). In these conditions, it is possible to identify six types of sectorial structural changes, respectively: a) re-agrarization, if the weight of primary sector increased during

analyzed period b) de-agrarization, if the weight of primary sector decreased during

analyzed period c) re-industrialization if the weight of primary sector increased

during analyzed period d) de-industrialization if the weight of secondary sector decreased

during analyzed period e) tertialization if the weight of tertiary sector increased during

analyzed period f) de-tertialization if the weight of tertiary sector decreased during

analyzed period

If only three sectors are considered, the computation of coeffi cient of intensity of sectorial structural changes (CISCs) is a particular case


of CISC. In order to compute CISCs, the following formula may be used: CISCs Tws IEu IEv= ⋅ +

(10), equivalent with:

2CISCs Tws DCSCs= ⋅ ⋅ (11) where Tws, IEu, IEv and DCSCs have analogous signifi cance as in case of CISCr. Also, it is to note that u= number of sectors experiencing gains in their relative importance and v= number of sectors experiencing loses in their relative importance. The minimum value of CISCs in conditions of a fi xed transfer of

weights (CISCsmin) is 32

CISCs Tws= ⋅

The maximum value of the above-mentioned indicator in conditions of fi xed transfer of weights (CISCsmax) is: 2CISCs Tws= ⋅ Because we deal with only three sectors, the transfer of weights is entirely located in one of the considered sectors. Also, we are able to make a hierarchy of the sectors from the point of view of absolute value changes of weights registered in the analyzed period and determine the main and secondary sense of the structural changes. Therefore, we consider as the main sense of structural change the type of change which occurred in the sector where modifi cation of weight is maximum from the point of view of absolute value. The secondary sense of structural change is considered the type of change that happened in the sector where the modifi cation of weight is placed as the second one from the point of view of absolute value. It is possible to identify twelve situations from the point of view of senses of sectorial structural changes, respectively:A) Main sense = re-agrarization, Secondary sense de-industrializationA) Main sense = re-agrarization, Secondary sense de-tertialization A) Main sense = de-agrarization, Secondary sense re-industrializationA) Main sense = de –agrarization, Secondary sense tertializationA) Main sense = re-industrialization, Secondary sense de-agrarizationA) Main sense = re-industrialization, Secondary sense de-tertializationA) Main sense = de-industrialization, Secondary sense re-agrarizationA) Main sense = de-industrialization, Secondary sense tertializationA) Main sense = tertialization, Secondary sense de-agrarizationA) Main sense = tertialization, Secondary sense de-industrializationA) Main sense = de-tertialization, Secondary sense= re-agrarizationA) Main sense = de-tertialization, Secondary sense= re-industrialization


It is to note that in practice the main sense of sectorial structural changes are determinated by the features of the stages of development of the analyzed economy. Therefore, if we consider the hypothesis presented in J. Fourastie (1989) and A Toefl er (1980), in the long run, the main sense of structural change in general and especially in case of employed population is de-agrarization during the period of building the base industrial structure1 and tertialization during the transition to a post-industrial society2.

3. IDENTIFICATION OF MODELING FACTORS OF DIFFERENTIATION OF THE COEFFICIENTS OF INTENSITY OF STRUCTURAL CHANGES AT SECTORIAL AND ECONOMIC BRANCH LEVEL

The Coeffi cient of Intensity of Structural Change computed at sectorial level (CISCs) differs from CISC computed at the level of economic branches (CISCr). The respective differentiation is determinated not only the considered level of aggregation of data, but also by the features of intrasectorial structural changes. If we consider the economic branches grouped within the three sectors CISCr may be written as:

∑

∑

∑

∑∑

=

=

=

=

= −

−⋅

−

−⋅−= ms

mmsms

r

iii

sss

ms

mmsms

sss

ggabs

gg

gg

ggabsggCISCr

1

201

1

201

3

1

201

1

2013

1

201

)(

)(

)(

)()( (12)

1 The period of building the base industrial structure is considered by A Toefl er as the second wave of the development of economy and society. The above-mentioned author considered that the second wave of economic and social development of mankind has begun with the industrial revolution and ended during 1950’s in the most developed western countries, being characterized by the tendency to create and develop mass production within industrial fi rms. Also, it is to ob-serve that during the second wave period, an important transfer of population from rural areas to urban areas took place. It is to observe that the respective structural change was replicated in the other countries during the periods when their industrial base was created and developed.2 According to Toefl er the Third Wave of economic and social development of the mankind be-came manifest in the most developed market economies during the late 1950’s. The is a period of transition to a post industrial society where generation and use of informational –communicational technologies plays a role which become bigger and bigger for economic activities. Consequently, there are stimulated the de-massifi cation of the industrial activities and descentralization of the decisions taken by economic and social actors. In these cnditions, services sector supplies the most important part of the jobs, while the weights of primary and secondary sectors in the employed population consanly decrease.


where: gs1, gso= weights registered by sectors in year 1 and year 0 respectively. gms1, gmso= weights registered by economic branch m, which is grouped within sector s in year 1 and year 0 respectively. abs (gms1-gms0)= absolute value of the difference between the weights registered by economic branch m, which is grouped within sector s in year 1 and year 0 respectively.

In these conditions, we may defi ne the index of intrasectorial structural change (IIaS) by using the formula:

32

1 01 1

32

1 01

( ( ))

( )

ms

ms mss m

s ss

abs g gIIaS

g g

= =

=

−=

−

∑ ∑

∑ (13)

It is to note that there are two situations detected by IIaS, namely: a) when all the structural changes registered at the level of branches

are in accordance with the sense of structural changes registered at sectorial level. In this case IIaS=1

b) when the sense of at least one of structural changes registered at the level of branches is in contradiction with the sense of structural change registered at sectorial level. In this case IIaS>1.

Also, we may defi ne the index of concentration of structural changes within sectors (ICsect), by considering the formula:

21 0

13

21 0

1 1

( )sec

( ( ))

r

i ii

ms

ms mss m

g gIC t

abs g g

=

= =

−=

−

∑

∑ ∑ (14)

If we consider IIaS and ICsect, we may also compute CISCr by using the formula: secCISCr CISCs IIaS IC t= ⋅ ⋅ (15)


4. A NUMERICAL EXAMPLE. COMPUTATION OF CISC IN CASE OF EMPLOYED POPULATION AT ECONOMIC

BRANCH AND SECTORIAL LEVEL DURING THE PERIOD 2008-2011

In order to illustrate the proposed improvement methodology for the interpretation of Coeffi cient of Intensity of Structural Changes, there were identifi ed the modeling factors of differentiation of CISCs and CISCr in case of employed population during period 2008-2011, considering data from Romania’s Statistical Yearbook for 2012. There are taken into account a number of 10 economic branches, which are grouped into three sectors, namely: a) Primary sector with a single economic branch, Agriculture, forestry and fi shing, respectively b) Secondary (industrial) sector with four economic branches, namely: 1) Mining and quarrying, 2) Manufacturing, 3) Energy, gas and water production and supply and waste management, 4) Constructions c) Tertiary (services) sector with fi ve economic branches, namely: 1) Wholesale and retail, repair of motor vehicles, hotels and restaurants, 2) Transport, storage, information and communication activities, 3) Financial intermediation, insurance, real estate activities, professional, scientifi c and technical activities, 4) Social infrastructure services (public administration, education, health), 5) Shows, culture and recreation activities and other services activities. The structural changes of employed population during the period 2008-2011 at the level of economic branches and sectors respectively are shown in table no.1 and table no.2, respectively.


Structural changes of employed population at level of economic branches in Romania during the period 2008-2011

Table no.1 %

Economic branch Weight in 2008

Weight in 2011

Differences of weights

Economy as a whole 100,00 100,00 0,00Agriculture, forestry, fi sheries 27,52 29,19 1,67Mining and quarrying 0,93 0,78 -0,15Manufacturing 19,33 17,87 -1,46Energy, gas, water 2,39 2,32 -0,07Constructions 7,91 7,30 -0,61Wholesale and retail, Hotels and restaurants 15,21 15,48 0,27Transportations, Comunications 6,33 6,81 0,48Financial intermediation and professional services 6,25 6,56 0,31Social services 12,05 11,43 -0,62Culture and recreation services and other services 2,08 2,26 0,18

Structural changes of employed population at sectorial level in Romania during the period 2008-2011

Table no.2%

Economic branch Weight in 2008 Weight in 2011Differences of

weightsEconomy as a whole 100,00 100,00 0,00

Primary sector 27.52 29.19 1.67Secondary sector 30.56 28.27 -2.29

Tertiary sector 41.92 42.54 0.62

The computation of indicators related to proposed methodology for CISC computation at economic branch and sectorial level with the help of R Software (Annex no.1) led to following results:

CISCr= 2.48%. Twr= 2.91%, IEp= 0.3806, IEq= 0.3448, DCSCr= 0.6021CISCs= 2.90%. Tws= 2.29%, IEu= 0.6058, IEv= 1.000, DCSCs= 0.8960

We may observe that size of CISCr is relatively small one, i.e. 2.48%. The respective indicator is obtained in conditions of a transfer of weights of 2.91% and of a moderate degree of concentration of structural changes, 0.6021, respectively. CISCs is equal with 2.90%. During the analyzed period, the main sense of sectorial structural change was de-industrialization, while the secondary


sense was re-agrarization. At a fi rst sight, the sense of sectorial structural change appears to be in contradiction with the trend of the long run mutations of the employment model during the transition to a service economy. But the respective structural change of employed population is recommendable to be seen related to the situation of Romania’s economy during the analyzed period. It is important to note that during 2009-2010 Romania faced an economic recession, which led to the loss of jobs especially in the secondary (industrial) sector. In the same time, due to the fact that the demand for labour decreased relatively slowly in the services sectors, their weight in total employed population increased. But during the analyzed period the employed persons in the primary sector have registered a growth. Therefore, the secondary sense of employment structural change was the re-agrarization. The size of CISCs is greater than the size of CISCr, although the sectorial transfer of weights (2.29%) is smaller than the transfer of weights registered when the economic branches are considered (2.91%). The explanation of the respective situation is the sensibly higher degree of concentration of sectorial structural change in comparison with the situation registered at economic branch level. The respective explanation is confi rmed by computation of IIaS and ICsect. Therefore, we obtained IIaS= 1.1692 and ICsect= 0.7309. In these conditions, ratio between CISCr and CISCs is equal with 0.8546. It is to note that supraunitary value of IIaS is a consequence of the fact that the sense of change of relative importance registered by social services is in contradiction with the sense of change of relative importance registered by tertiary sector as a whole related to employed population.

References

1. C. Clark- Les conditions du progres economique, PUF Paris, 1960 2. E Dobrescu –Ritmul creşterii economice, Editura Politică, Bucureşti, 1968

(The Rate of Economic Growth, Political Publishing House, Bucharest, 1968)

3. E. Dobrescu- Measuring the Interaction of Structural Changes with Infl ation, Romanian Journal for Economic Forecasting, no. 6/2009

4. J. Fourastie – Le Grand Espoir du XX-eme siecle. Progres technique, Progres economique, preogres social, edition revue et mise a jour, Tel Galimard, Paris, 1989

5. O. Onicescu, M Botez- Incertitudine şi modelare economică, Editura Ştiinţifi că şi Enciclopedică, Bucureşti, 1985 (Incertitude and Economic Modeling, Scientifi cal and Encyclopedical Publishing House, Bucharest, 1985)


6. F. M. Pavelescu-Progresul tehnologic şi ocuparea forţei de muncă, Editura IRLI, Bucureşti, 1997 (Techonological Progress and Employment, IRLI Publishing House, Bucharest, 1997)

7. F. M. Pavelescu- Transformarea economiei şi dezechilibrele pieţei forţei de muncă, Editura IRLI, Bucureşti, 2003 (Transformation of Economy and Labour Market Disequilibria, IRLI Publishing House, Bucharest, 2003)

8. F. M. Pavelescu - Remodelarea aparatului productiv şi evoluţia structurii populaţiei ocupate, Centrul pentru Informare şi Documentare Economică, Bucureşti, Colecţia “Biblioteca Economică”, seria “Probleme economice nr. 270-271/2007 (Reshaping the productive apparatus and the evolution of employed populatuon structure, Center for Economic Information and Documentation, Bucharest, Collection „Economic Library”, Series „Economic Problems no. 270-271/2007).

9. A Toefl er – The third wave, Bantam Books, 1980 10. Y. Saolo, I. Gaude, R.Wery- Les tertiaires. Analyse comprative de l

acroissance de l’emploi dans les activites tertiaires, BIT, Geneve, 1974.

Annex no.1

The instructions used in R Software in order to compute CISCr and CISCs and their modeling factors

#import dataramuri2008 <- read.table(fi le.choose())ramuri2011 <- read.table(fi le.choose())

sectoare2008 <- read.table(fi le.choose())sectoare2011 <- read.table(fi le.choose())

## Computation of CISCr#Computation of weights of economic branchesgi0 <- round((ramuri2008$V1/sum(ramuri2008$V1)*100),4)gi1 <- round((ramuri2011$V1/sum(ramuri2011$V1)*100),4)

#Computation of differences gi1-gi0dif_gi <- gi1-gi0

##Computation of CISCr with classical method (CISCr1)CISCr1 <- round((sum(dif_gi^2)^0.5), 4)

##Computation CISCr with proposed method#A) Selection of positive values of dif_gidif_poz_gj <- dif_gi[dif_gi>0 & !is.nan(dif_gi)]


#B) Selection of negative values of dif_gidif_neg_gk <- dif_gi[dif_gi<0 & !is.nan(dif_gi)]

#C) Computation of transfer of weights considering r economic branches (Twr)Twr <- sum(dif_poz_gj)

#D) Computation of informational energy of positive dif_gj (IEp)IEp <- round((sum(dif_poz_gj^2)/Twr^2), 4)

#E) Computation of informational energy of negative dif_gk (IEq)IEq <- round((sum(dif_neg_gk^2)/Twr^2), 4)

#F) Computation of degree of concentration of structural change considering r economic branches (DCSCr)DCSCr <- round((((IEp+IEq)/2)^0.5), 4)

#G) Computation of CISCr with formula issued from proposed methodology (CISCr2)CISCr2 <- round((Twr*2^0.5*DCSCr), 4)

##Computation of CISCs

# Computation of sectorial weightsgs0 <- round((sectoare2008$V1/sum(sectoare2008$V1)*100),4)gs1 <- round((sectoare2011$V1/sum(sectoare2011$V1)*100),4)

# Computation of differences gs1-gs0dif_gs <- gs1-gs0

## Computation of CISCs with clasical method (CISCs1)CISCs1 <- round((sum(dif_gs^2)^0.5), 4)

## Computation of CISCs with proposed method (CISCs2)# A) Selection of positive values of dif_gsdif_poz_gu <- dif_gs[dif_gs>0 & !is.nan(dif_gs)]

# B) Selection of negative values of dif_gsdif_neg_gv <- dif_gs[dif_gs<0 & !is.nan(dif_gs)]


# C) Computation of sectorial transfer of weights (Tws)Tws <- sum(dif_poz_gu)

# D) Computation of informational energy of positive dif_gu (IEu)IEu <- round((sum(dif_poz_gu^2)/Tws^2), 4)

# E) Computation of informational energy of negative dif_gv (IEv)IEv <- round((sum(dif_neg_gv^2)/Tws^2), 4)

# F) Computation of degree of concentration of sectorial structural change (DCSCs)DCSCs <- round((((IEu+IEv)/2)^0.5), 4)

# G) Computation of CISCs with formula issued from proposed methodology (CISCs2)CISCs2 <- round((Tws*2^0.5*DCSCs), 4)

## Identifi cation of modeling factors of Coeffi cients of Intensity of Structural Changes differentiation at sectorial and economic branch level

#A) Computation of sum of absolute values of differences (gi1-gi0) within sectorsdif_abs_s1 <- abs(dif_gi[1])dif_abs_s2 <- sum(abs(dif_gi[2:5]))dif_abs_s3 <- sum(abs(dif_gi[6:10]))

#B) Computation of sqabs_uvsqabs_uv <- (sum(dif_abs_s1^2,dif_abs_s2^2,dif_abs_s3^2)^0.5)

#C) Computation of IIasIIas <- sqabs_uv/CISCs2

#D) Computation of ICsectICsect <- CISCr2/sqabs_uv

#E) Computation of ratio between CISCr and CISCs (ratio_CISCr_CISCs)ratio_CISCr_CISCs <- CISCr2/CISCs2


Using R To Get Value Out Of Public Data PhD Candidate Marius RADU PhD Assistant Ioana MUREŞAN PhD Professor Răzvan NISTOR Babeş-Bolyai University, Faculty of Economics and Business Administration

ABSTRACT

Public sector information contains great value for the citizens in general. Data stored on computers of public institutions doesn’t have value on its own. It has to be pro-cessed and analyzed to obtain information, and further on, information should be made available as public good, in order to facilitate its transformation to knowledge. R is a free software programming language, an environment and toolkit of modules addressed to anyone working with statistics. R can ease the road from public data to civic wisdom. This article is a brief review of R capabilities to extract, transform, analyze, and visualize public data. Second part of the article presents an example of a fully-fl edged web ap-plication written entirely in R. The application uses loosely structured government data about Romanian Auto Park in order to present it in a friendly dashboard. Key Words: Open Data, R programming language, Reporting Web Applications

INTRODUCTION

Data revolution moves forward and public data initiative is part of this movement. Having a strategic and informed view on public data is for everyone’s interest. Open data initiatives aims to make communities more functional, sustainable and effective. There are hundreds of specialty forums, blogs and professional groups with discussions about big data, future of NO-SQL, engineering approaches that can cope with the new structure. The public data topic is more and more frequently approached. On the other side we rarely can see applications, results or solutions for real problems using open data for the public good. Open Data is considered to be included as part of the larger concept of Open Government. In a large understanding open data is both “technically open” and “legally open”. Technically open means available in a machine-readable standard format, while legally open means that data is explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions [1]United States”,”source”:”ProQuest”,”event-place”:”Washington, United States”,”abstract”:”According to the foundation’s


California Open Data Handbook, published in collaboration with Stewards of Change Institute, a national group supporting innovation in human services, data must fi rst be both \”technically open\” and \”legally open.\” If the intended users are developers and programmers, Shaw said, the data should be presented within an application programming interface (API. Between “public data” and “open data” is a thin delimitation and in the present article we will consider this two concepts covering the same thing. In this paper we will see two main approaches of using R towards open data: 1) The analytical process: For the fi rst approach we will use from

EU public data portal a set of open data regarding generation of waste in different countries. Paper will give an example regarding municipal waste in Egypt. The example is created to illustrate R capabilities in respect with open data and not necessary to revel outstanding facts about environment.

2) The application construction to share knowledge from open data. For the second approach we will use open data regarding Romanian Auto Park for 2013. We will use this data from Romanian government data portal to create with few lines of R code a web reporting application

The focus of this article is placed on different ways in which R can help to solve problems with open data, but our intention is not to present in depth a certain case and a problem, neither to present the spectrum of R packages and the complexity of problems that can be solved.

PUBLIC DATA AND THE R POTENTIAL

If we consider only the Romanian public data, there are hundreds of data sets available at Data.gov.ro, or at city data catalogs like http://data.e-primariaclujnapoca.ro/. A big challenge is to make these data add insight or utility to citizen’s everyday lives. R can do this with simplicity and no monetary costs. Using R for public data exploration is a meaningful opportunity to tell stories that are relevant to a region and the individuals. R is a tremendous useful tool in many ways when it comes to converting open data into information that people can distill into knowledge and insight. Behind all positive aspects of open data there are hurdles and risks which complicate the value extraction of the data. Martin and Foulnneau identify seven categories of risks to Open Data initiatives: governance, economic issues, licenses and legal frameworks, data characteristics, metadata, access, and skills [2]United Kingdom”,”page”:”301-XVII”,”source”:”ProQuest”,”event-


place”:”Kidmore End, United Kingdom”,”abstract”:”Despite the development of Open Data platforms, the wider deployment of Open Data still faces signifi cant barriers. It requires identifying the obstacles that have prevented e-Government bodies either from implementing an Open Data strategy or from ensuring its sustainability. This paper presents the results of a study carried out between June and November 2012, in which we analyzed three cases of Open Data development through their platforms, in a medium size city (Rennes, France. On their side, government agencies, municipalities and other public entities should address many challenges related with the infrastructure that make data available to the public. In general, one of the important barriers to the development of the data centric public goods is the ability of public organizations to store and make use of the data. Saying this we think about organizational and economical abilities not necessary about technical know-how. Aside of business tools for data analysis there are plenty of other open-source tools aimed to cope with data analytics in general, and which can be used for public data acquisition, cleaning, analysis, modeling and visualization. To mention just few of them I would choose: Jasper Reports and Pentaho for reporting, Tableau Public, Google Charts and Google Fusion Tables for data visualization, Octave and Python with Numpy, Scipy, scikit-learn for statistics, machine learning and analytical needs. In this open-source software toolbox R can be consider a powerful tool suitable for the entire stack of statistical and analytical problems. Governments have a large amount of data with unknown value, until we attempt to fi nd its value. Considering the technical efforts, from a high level point of view, open data should be available in three forms: 1. As application programming interface (API) and formats that allow

a user to query and subset data such as json or xml. These are for are developers and programmers;

2. As downloadable fi les, structured standardized data in highly utilized machine readable formats (csv, kml, xml, and even xls) for researchers, journalists and students;

3. Ready presented for citizens who are looking for information.

DATA-DRIVEN JOURNALISM – AN EXAMPLE OF USING R WITH OPEN DATA

Tim Berners-Lee the inventor of the World Wide Web considers that analyzing data is the future for journalists. Public data in this context is the substance that can support information about politics effi ciency,


community management, or city evolution. Further on, Tim Berners-Lee asked an interesting and intriguing question: “Who’s really going to hold the government, or anyone else, accountable?”[3]. Journalists today have a larger free toolbox available to fi nd stories. Many journalist are already accustomed to use readymade stats, databases, spreadsheets, moreover seasoned journalists use powerful scripts written in languages like Python, Ruby and R for scraping data from the web. Amy Schmitz Weiss from San Diego State University observes that we are now entering in the age of the “Digital Media Data Guru” – this guru is a person with a hybrid of computer science and journalism skills who is able to “do it all” in the newsroom [4]. Trained data journalists can use R to analyze huge datasets that extend the limits of Excel, for instance, a table with a million rows. R is often used as scripting language for fi le management but especially for data extract-transform-load (ETL) processes. A researcher or journalist can run any script fi les like in Figure 1 bellow or they can run simple command lines to process streams of data using R scripts. An example on Linux OS is:

cat mathscoresInputFile.csv | Rscript -e ‘quantile(as.numeric(readLines(“stdin”)))’ >> resultsOutputFile.csv

Figure 1

Web scraping or web data extraction is a software technique of extracting information from websites. For example a tech savvy journalist may fi nd useful to use an R script like this one bellow to scrape on www.monster.ie website for jobs of interest containing the words “R” and “journalism”. Data regarding job descriptions obtained with this method can be analyzed further to depict more frequent requested skills or to fi nd semantic relationship between words used in descriptions. R has an entire set of packages for text mining and analysis e.g. tm, RWeka, etc


rm(list=ls())

library(XML)library(plyr)require(RCurl)

setwd(“E:/WORK_2014/ScrapingStuff”)urls<-c(“http://jobsearch.monster.ie/jobs/?q=jounalism-R&cy=ie”, “http://jobsearch.monster.ie/jobs/?q=jounalism-R&pg=2&cy=ie”) for ( u in urls ) { web_page<-readLines(u) # Pull out the appropriate line jobs_lines <- web_page[grep(“slJobTitle”, web_page)] jobs_lines <- jobs_lines[grep(“.aspx”, jobs_lines)] code<-strsplit(as.character(jobs_lines),’\”’,fi xed=TRUE) vect1<-matrix(NA, ncol=2, nrow=length(jobs_lines)) for(i in seq(length(jobs_lines))) { vect1[i,1] <-code[[i]][15] vect1[i,2] <-code[[i]][14] } vect1<-as.data.frame(vect1) vect1 write.table(vect1, “jobsJournalismandR.tab”, append = TRUE, sep=”\t”, row.names = FALSE, col.names = FALSE)}

Open data is fast approachable using R programming language. Why R? Because: • R is free distributed under terms of the GNU General Public License

version 2. • R has large core statistical analysis toolkit and access to powerful

and cutting-edge analytics libraries. • R is a language and analysis is done by writing functions and

scripts. R is an interactive language, it promotes experimentation and exploration.

• R has powerful graphics and data visualization capabilities. • R has a large community of users and developers.


INFORMED POLITICAL DECISIONS AND BUSINESS VALUE FROM PUBLIC DATA

Politicians and local offi cials can also benefi t from open data. The data that were locked away in department’s offi ce desks have become available and accessible. Open data becomes the root of an information platform for viewing the city more holistically and making more informed decisions based on more information [5]. European public funds and projects would be better managed using information from public data. Public data could help in coordination and decisions process at high level for policy makers and also at the low level at individual projects’ implementations. For example an agricultural project or an infrastructure project would be better planned based on public data which in present is not really available. We do not consider this is not feasible or data is not available at all, but simplicity of access, proximity, political openness are still only desiderates and not real certainties. In the near future data for the public good tend to be driven by an eclectic community of media, nonprofi ts and academics focused on delivering information in different forms to the communities. Public access to government data creates economic and business value and encourages entrepreneurship. Socio-economic census information, traffi c patterns, and bus schedules are good data sources for applications and content development, but these do not make an open government [6]NY”,”genre”:”SSRN Scholarly Paper”,”source”:”papers.ssrn.com”,”event-place”:”Rochester, NY”,”abstract”:”“Open government” used to carry a hard political edge: it referred to politically sensitive disclosures of government information. The phrase was fi rst used in the 1950s, in the debates leading up to passage of the Freedom of Information Act. But over the last few years, that traditional meaning has blurred, and has shifted toward technology. Open technologies involve sharing data over the Internet, and all kinds of governments can use them, for all kinds of reasons. Recent public policies have stretched the label “open government” to reach any public sector use of these technologies. Thus, “open government data” might refer to data that makes the government as a whole more open (that is, more accountable to the public. There are weekly hackathons around the world that produce a number of useful tools online, including open data tools. To mention few of these tools we would just say Scraper Wiki, Google Refi ne, mapping and format converters like IssueMa and Copypastemap. Open sources are often very useful business purposes because they can be easily accessible, inexpensive, quickly accessed and voluminous in availability. Marketers more and more rely on open sources of data in


developing strategic plans and tactics [7]. GPS data and weather data usage are just two success cases we want to mention in this sense. Moreover it is hard to imagine in the long run a major achievement with open data without business involvement and support from business partnerships.

A SIMPLE DATA REPORTING PIPELINE – THE FIRST APPROACH TO USE R WITH OPEN DATA

Starting working with R for data analysis might be frustrating and diffi cult. There are many isolated tutorials on web but the heterogeneity and sparse distribution of thousands of R packages make learning process challenging. A user of statistical packages tends to run a reduced set of procedures for a specifi c type of analysis. This analyst might wonder why should learn the R language rather than using a package that provides friendly menus. The answer is still debatable. A statistical package is friendly but often a pricey tool. The main downside of statistical packages is the ‘black box’ nature of it. With statistical packages analysts can set up the analysis with all the parameters and options that they need; after they run the procedure, the resulting output may be long and verbose. Later they will pull out only the data needed. The main limitations of statistical packages come together with menus and embedded constraints and assumptions. For example different from a statistical package in R we can change the ‘tol’ argument in QR function. This argument controls whether QR decomposition of a matrix will return a value or not for a column depending on whether the column has been judged to be linearly dependent. The R paradigm is very different. With R a researcher has more freedom, fl exibility and also more responsibility. An analyst who is using R can go straight to the elements of interests, but in the same time he should be careful to methodologies used, statistical assumptions that come in the process.

PROGRAMMING PARADIGMS IN R

R borrows features from both functional programming languages (Lisp, Scheme) and object oriented programming languages (C++)

printHello <- function(name){ print(paste(“Hello, “, name))}


R has as a system for object-orientation: S3 and S4 are (i.e. built in) approaches for OO programming in R. There are still open debates related with the OOP system robustness in R language. On the other hand R is a strongly functional language. David Springate offers a performance benchmark example on his blog [8].

# Get all even numbers up to 200000# C style vector allocation:x <- c()for(i in 1:200000){ if(i %% 2 == 0)f x <- c(x, i)}## user system elapsed## 9.86 0.00 9.88

# FP style vectorised operationa <- 1:200000x <- a[a %% 2 == 0]## user system elapsed## 0.01 0.00 0.01

Regarding parallel programming, in R it is possible to do concurrent programming, for example running more functions in the same time with while-loops concurrently. The snow, Rmpi, and pvm packages support these aspects across computers and also on a multi-CPU or multi-core computers. Starting with R 2.14.0, the parallel package has bundled parts of snow and multi-core in the basic R distribution. Further we will present a brief data analysis process using R and open data source from the European data portal. We will try to enhance the R programming environment capabilities to get the data, explore, model and communicate the results in a meaningful way.

SETTING-UP THE R WORKING ENVIRONMENT

The working process with R is highly interactive. The analysts run a command for each desired granular output. For example, below we prepare the working environment with few commands to clean memory, set working directory, and evaluate its content.


rm(list=ls())#set working directorysetwd(“D:/WORK_2014/Articol_Revista_de_Statistica/RContent/DataAnalysisPipeLine”)list.fi les() # list fi les from directory

#### Example - Using R to collect archived data from Public Data Collection#### This example regards Generation of waste #### by sector data from EU data portal

R has the main advantage of the community on CRAN with over 2500 packages. Nothing will compare with this in the near future, not even a commercial application like SPSS, Matlab or SAS. R and its packages are written primarily in C and Fortran, although it is being extended through other languages. Here is the way we install and use packages:

#install.packages(“R.utils”)#install.packages(“ggplot2”)library(“R.utils”) # Instrumental package to work with archiveslibrary(“ggplot2”) # Tremendous package for charts

GETTING THE DATA WITH R

R can get archived data (zip, tar.gz etc.) and read a wide spread formats (csv, xls, tab etc). R can easily get data from API’s, in web data formats (json, xml, html etc). R is a good scripting language and it can take as input streaming data from big data storages systems like Hadoop. R has capabilities to interact with entire spectrum of databases from relational to no-sql storage systems (Postgres, MongoDB etc.).

# Collect and Mange Public data from URI: # http://ec.europa.eu/eurostat/product?code=med_en22temp <- tempfi le() ## use temporary environment to create a temp. fi ledownload.fi le(“http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?fi le=data/med_en22.tsv.gz”,temp)dataf <- gunzip(temp, “temp.tab”) ## unzip the temporary fi le; R provides full spectrum on fi le operationsdataf <- read.csv(dataf, sep=”\t”) ## read the data from structured fi le, Tab Separated Values in this caseunlink(temp) ## Remove the temp fi le via unlink()if (fi le.exists(“temp.tab”)) fi le.remove(“temp.tab”) ## Remove temporary fi le


EXPLORING THE DATA WITH R

With few lines of R command code we can generate fast descriptive statistics related with the working data.

# These four functions bellow can give an overview of the fi le structure and contentdim(dataf); str(dataf); summary(dataf)View(dataf) # a quick view of the small data set

R has a very terse syntax. From the very beginning R was designed as a language specifi c for data processing. Its data structures as data.frame, matrix, and lists make data crunching, manipulation and transformation very effi cient.

# Data processing phase implies cleaning and manipulating only data of interestdatafII<-dataf[36,-1] #Select the: Thousands of tones of municipal waste cross time in Egypt

# gsub2 - this is an instrumental local created function, used for cleaning data step# R is a functional programming language; functions can be considered objects in R; here we create a function to be used locallygsub2 <- function(pattern, replacement, x, ...) { for(i in 1:length(pattern)) x <- gsub(pattern[i], replacement[i], x, ...) x}

# using the gsub2x<-gsub(„X“,““,names(datafII)); xy<-as.numeric(gsub2(c(«: «,» «),c(«»,»»),as.matrix(datafII))); y

TRANSFORM THE DATA USING R

In most data analysis exercises, preparing the data is more than half of work. The analyst has to fi nd where the data is, has to fi gure out how to access it, to fi nd the right records, to clean, to fi lter and transform it before any statistical analysis can be done.


# prepare the data structured for plottingdf <- as.data.frame(cbind(x,y))names(df)<-c(“Year”, “Quantity”)# R has very fast selection, fi ltering, merging methods# also the syntax is very terse, and specifi c to functional languagesdf<-df[!is.na(df[,c(2)]),]

# This line converts the factor levels in numeric values# here is intentionally enhanced the specifi city of main# data structures in R: data.framesdf$Quantity<-as.numeric(levels(df$Quantity))[df$Quantity] df$Year<-as.numeric(levels(df$Year))[df$Year]

# Anytime we can interrogate the data to understand the status. # this is one of the beauty of R environment:# Interactive environmenthead(df) summary(df)

COMMUNICATE AND VISUALIZE THE RESULTS

With tables and plots generated we can learn new things from data and generate useful insights. There are several different graphics systems in R. The oldest one is base graphics. Base graphics is analogous to drawing on canvas in successive phases. The lattice and ggplot2 packages provide functions for high-level plots based on grid graphics. Both base and grid graphics are device independent. Ggplot2 provides a unifi ed framework and a set of options and modifi ers present in base graphics. Moreover it is hard to fi nd any visualization method or data wrangling technique that is not already built into R.

#load the package used to save data as MS Excel in a specifi c sheetlibrary(xlsx) write.xlsx(x = df, fi le = “MunicipalWasteEgypt.xlsx”, sheetName = “Quantity_by_Year”, row.names = FALSE)

# plot the datap <- ggplot(df, aes(x=Year)) + geom_bar(data = df, aes(y=Quantity, fi ll = Quantity), stat = “identity”) + scale_x_continuous(expand=c(0.1,0)) + ylim(0, 25000) + ggtitle(“Municipal waste cross time in Egypt (in 1000T)”)p


#Save the plot to be used laterjpeg(“Municipal_waste_cross_time_in_Egypt.jpg”)pdev.off()

Main results from the scripts are presented below. From the chart we can understand the evolution of municipal waste cross time in Egypt. Such kind of charts can support any presentation in a fi eld where we can obtain open data sources.

Chart presenting the Municipal waste cross time in EgyptFigure 2


Municipal wastes cross time in Egypt Table 1

Year Quantity2010 216322009 208002008 204002007 200002006 165002005 192002004 189002003 184002002 178002001 172002000 16700

SOFTWARE APPLICATIONS WITH OPEN DATA – THE SECOND APPROACH TO USE R WITH OPEN DATA

R is to not only for quantitative analysis, but it is used also to construct computer desktop applications and Web applications. The following part of this article presents R capabilities in creating data driven applications with open data and R language. R-shiny package allows R programmers to transform without much effort their analysis into interactive web applications, accessible by everyone in browsers. Shiny package has embedded prerequisites to build a web application without knowledge of CSS or JavaScript technologies. R-shiny application allows building full applications for reporting containing controls, sliders, plots tables and summaries. It is designed to work on local port but it has also a server version. Shiny is micro framework which can help statisticians to learn fundamentals of web development. R-shiny is not the single web framework present in R portfolio. There are other packages like Rook, which is a web server interface and an R package in the same time. Rook applications are usually combined with another product rApache, which is from the same author - Jeffrey Horner. This is a framework supporting web application development using the R statistical language and environment and the Apache web server. An R application can use data sources in two main different forms: 1) from an API; 2) from a data store, database or data source fi les in repositories.


Diagram presents how an R web application might be sourced with open data

Figure 3

Shiny applications in general contain basically two main R scripts, which are kept within the same folder. They should be named server.R and ui.R. Besides these two there can be used complementary R scripts to build up further the applications. e.g. global.R contains code that has to be run at initiation and is used by the entire application. With source(“codeScript.R”) we can bring into the applications any other R functionalities that we want to have working in the application. An Example of Data Reporting Web Application on micro framework R-Shiny:File: ui.R

library(shiny)

# Defi ne UI for random distribution application shinyUI(pageWithSidebar( # Application title headerPanel(“Auto Park - Reporting with Public Data”), # Sidebar with controls to select the the City and the Detil related with auto park sidebarPanel( wellPanel( radioButtons(“dist”, “Report on:”, list(“Cars by Cities” = “cbc”, “Zoom In” = “parc”)) ),


conditionalPanel(condition = “input.dist==’cbc’”, wellPanel( h4(p(strong(“Select the Cities”))), selectInput(“variableJD1”, “City:”, jdls), selectInput(“variableJD2”, “City:”, jdls), selectInput(“variableJD3”, “City:”, jdls), selectInput(“variableJD4”, “City:”, jdls) ) ), conditionalPanel(condition = “input.dist==’parc’”, wellPanel( h4(p(strong(“Select the Detail”))), selectInput(“variableEL1”, “Element:”, elem) ) ) ), # ----close sidebar panel # Show a tabset that includes a plot and two table views mainPanel( conditionalPanel(condition = “input.dist==’cbc’”, h4(p(strong(“Auto Endowment by City”))), tabsetPanel( tabPanel(“City Auto Park”, plotOutput(“plot”,width=”1000px”,height=”600px”)), tabPanel(“Table Report”, tableOutput(“table1”)) ) ), conditionalPanel(condition = “input.dist==’parc’”, h4(p(strong(“Zoom Into Data”))), tabsetPanel( tabPanel(“Zoom Into Auto Park”, tableOutput(“table2”)) ) ))))


File: server.Rlibrary(shiny)

# Defi ne server logic for random distribution applicationshinyServer(function(input, output) {

output$plot <- renderPlot({ ls1 <- list(input$variableJD1,input$variableJD2,input$variableJD3,input$variableJD4) p<-plotCars(dataf=data1,cityList=ls1) print(p) }) # Generate a summary of the data output$table1 <- renderTable({ ls2 <- list(input$variableJD1,input$variableJD2,input$variableJD3,input$variableJD4) dataCarsII(dataf=data1,cityList=ls2) }) # Generate an HTML table view of the data output$table2 <- renderTable({ ls <- list(input$variableJD1,input$variableJD2,input$variableJD3,input$variableJD4) tableElems(dataf=data1,param=input$variableEL1, cityList=ls) }) })

File: global.Rrm(list=ls())library(“ggplot2”)library(“data.table”)library(“reshape”)library(“RColorBrewer”) #############Part II#########################

data1 <- read.csv(“parcautoTopAll.csv”)long<-dim(data1)[1]; long; head(data1)data1$Serial <- seq(long)head(data1)

## This method creates the table for park comparisons cross cities## It is constructed using ggplot2dataCars <- function(dataf=data1,cityList=list(“B”,”CJ”)){


#dataf=data1 options(warn=-1) DT <- data.table(dataf, key = c(“Serial”)) dplt<-as.data.frame(DT[,sum(Numar),by=list(PARC.AUTO.2013,Judet)][Judet %in% cityList]) names(dplt)[3] <- c(“Numar”) dplt<-dplt[dplt$PARC.AUTO.2013!=”TOTAL”,] dplt <- as.data.frame(dplt) options(warn=0) return(dplt)}

dataCars()

dataCarsII <- function(dataf=data1,cityList=list(“CJ”,”B”)){ options(warn=-1) df <- dataCars(dataf,cityList) names(df)[3] <- “value” df <- cast(df,PARC.AUTO.2013~Judet,sum) return(df)}

dataCarsII(dataf=data1,cityList=list(“CJ”,”B”,”DJ”))

## This method creates the chart for park comparisons cross cities## It is constructed using ggplot2plotCars <- function(dataf=data1,cityList=list(“CJ”,”B”)){ dplt<- dataCars(dataf,cityList) colourCount = length(unique(dplt$PARC.AUTO.2013)) getPalette = colorRampPalette(brewer.pal(9, “Paired”)) p <- ggplot(data = dplt, aes(x = PARC.AUTO.2013, y = Numar)) p <- p + geom_bar(aes(fi ll= PARC.AUTO.2013), stat=”identity”) p <- p + scale_fi ll_manual(values = getPalette(colourCount)) p <- p + facet_grid(Judet~.) p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 1)) p}

## This method creates the table for charatereisics of park components cross citiestableElems <- function(dataf=data1,param=”Carburant_Benzina”, cityList=list(“B”,”CJ”)){ options(warn=-1) DT <- data.table(dataf, key = c(“Serial”)) df <- DT[Judet %in% cityList][,sum(noquote(Carburant_Benzina), na.rm=TRUE),by=Judet] names(df)[2]<-as.character(param) options(warn=0) df<-as.data.frame(df) return(df)}


jdls <- list(“Total”=”TOTAL”, “Alba”=”AB”, “Arges”=”AG”, “Arad”=”AR”, “Bucuresti”=”B”, “Bacau”=”BC”, “Bihor”=”BH”, “Bistrita-Nasaud”=”BN”, “Braila”=”BR”, “Botosani”=”BT”, “Brasov”=”BV”, “Buzau”=”BZ”, “Cluj”=”CJ”, “Calarsi”=”CL”, “Caras-Severin”=”CS”, “Constanta”=”CT”, “Covasna”=”CV”, “Dambovita”=”DB”, “Dolj”=”DJ”, “Gorj”=”GJ”, “Galati”=”GL”, “Giurgiu”=”GR”, “Hunedoara”=”HD”, “Harghita”=”HR”, “Ilfov”=”IF”, “Ialomita”=”IL”, “Iasi”=”IS”, “Mehedinti”=”MH”, “Maramures”=”MM”, “Mures”=”MS”, “Neamt”=”NT”, “Olt”=”OT”, “Prahova”=”PH”, “Sibiu”=”SB”, “Salaj”=”SJ”, “Satu-Mare”=”SM”, “Suceava”=”SV”, “Teloerman”=”TL”, “Timisoara”=”TM”, “Targoviste”=”TR”, “Valcea”=”VL”, “Vrancea”=”VN”, “Vaslui”=”VS”)

elem <- c(“Numar”=”Numar”, “Vechime_0_2”=”Vechime_0_2”, “Vechime_3_5”=”Vechime_3_5”, “Vechime_6_10”=”Vechime_6_10”, “Vechime_11_15”=”Vechime_11_15”, “Vechime_16_20”=»Vechime_16_20», «Vechime_20plus»=»Vechime_20plus», «Carburant_Motorina»=”Carburant_Motorina”, “Carburant_Benzina”=”Carburant_Benzina”, “Vechime_0_4”=”Vechime_0_4”, “Vechime_5_8”=”Vechime_5_8”, “Vechime_9_12”=”Vechime_9_12”, “Vechime_peste_12”=”Vechime_peste_12”)

To run the application we use the following commands:library(shiny)setwd(“/home/ubuntu/openapp”)runApp(“/home/ubuntu/openapp/rapp”, port=8101)

The application is running here: http://ec2-54-229-96-217.eu-west-1.compute.amazonaws.com:8101/ Below a print screen of the main page presents the data chart about car endowment in Romanian Auto Park. Such application can make easier the learning process about socio-economical or political realities using public data.


Print screen of the main pageFigure 4

CONCLUSIONS

Public data has a great potential to generate value for public wealth. Not only for experienced statisticians and analysts but for students, journalists and researchers in general the road from open data to valuable information and insights can be done faster by using R programming language. R language can help not only statisticians but any researcher who has courage to try to solve problems using it. In respect with open data R can come in hand in any analytical step and moreover with R can be built standalone software applications that help extraction of value and wisdom from public data.


References

1. J. Shueh, “Open Data: What Is It and Why Should You Care?,” McClatchy - Tribune Business News, Washington, United States, 17-Mar-2014.

2. S. Martin, M. Foulonneau, S. Turki, and M. Ihadjadene, “Open Data: Barriers, Risks and Opportunities,” in European Conference on e-Government, Kidmore End, United Kingdom, 2013, pp. 301–XVII.

3. C. Arthur, “Analysing data is the future for journalists, says Tim Berners-Lee,” The Guardian, 22-Nov-2010.

4. “At the intersection of journalism, data science, and digital media: How can j-schools prep students for the world they’re headed into?,” Nieman Journalism Lab. .

5. A. Black, “Open Data Movement,” PM Public Manag., vol. 94, no. 6, pp. 22–23, Jul. 2012.

6. H. Yu and D. G. Robinson, “The New Ambiguity of ‘Open Government,’” Social Science Research Network, Rochester, NY, SSRN Scholarly Paper ID 2012489, Feb. 2012.

7. C. S. Fleisher, “Using open source data in developing competitive and marketing intelligence,” Eur. J. Mark., vol. 42, no. 7/8, pp. 852–866, 2008.

8. “DASpringate/samatha,” GitHub. [Online]. Available: https://github.com/DASpringate/samatha. [Accessed: 24-Mar-2014].


Data Editing and Imputation in Business Surveys Using “R” Elena ROMASCANU ([email protected]) National Institute of Statistics, Romania

ABSTRACT

Purpose – Missing data are a recurring problem that can cause bias or lead to ineffi cient analyses. The objective of this paper is a direct comparison between the two statistical software features R and SPSS, in order to take full advantage of the ex-isting automated methods for data editing process and imputation in business surveys (with a proper design of consistency rules) as a partial alternative to the manual editing of data. Approach – The comparison of different methods on editing surveys data, in R with the ‘editrules’ and ‘survey’ packages because inside those, exist commonly used transformations in official statistics, as visualization of missing values pattern using ‘Amelia’ and ‘VIM’ packages, imputation approaches for longitudinal data using ‘VIMGUI’ and a comparison of another statistical software performance on the same features, such as SPSS. Findings – Data on business statistics received by NIS’s (National Institute of Statistics) are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. The appropriate automatic methods from R packages, offers the ability to set the erroneous fi elds in edit-violating records, to verify the results after the imputation of missing values providing for users a fl exible, less time consuming approach and easy to perform automation in R than in SPSS Macros syntax situations, when macros are very handy. Keywords: Business Surveys, Automated Edit Rules, Missing Values, Pat-tern of Missing, Random vs. Systematic Errors, Multiple Imputation, Non-Response Weights, Statistical software R, SPSS, SQL

INTRODUCTION

This paper is concerned only with an essential aspect of business surveys post data capturing stage, the treatment of numerical data under linear constrains done by computationally-intensive techniques. In order to build the editing strategy and to provide high quality statistical information, the methods discussed in this paper could be considered appropriate for identifying random errors. Therefore, the treatment of errors should be done accordingly to their origin: random or not-random (systematic errors) and to treat those non-random fi rstly, applying any automatic method. Improving editing techniques


for business surveys, means to make them less costly than traditionally has been done, while maintaining the accuracy. Business surveys can be classifi ed in two broad categories: those producing short-term statistics and those focusing on structural statistics. Therefore, the term “survey” will refer to “a sample survey”. The sampling frame is create from the Romanian Business Register (REGIS), which contains all enterprises, authorities and organizations as well as their local units that carried out any economic activity, their size or if they belong to the private or public sector. Registers containing detailed legal unit data records on a business population are used, but cannot always deliver, even after maintenance process or updating all specifi c information required. Conducting surveys is usually designed to obtain information directly from businesses and is widely used by Statistical Institutes due to its fl exibility to ask specifi c questions. The data on business statistics received by NIS’s are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. To produce statistical output these problems have to be treated using: error detection, correction and imputation. Edit and imputation (E&I) are known as one of the most important aspect of business surveys but a very time consuming process for NISs. The process of dealing with data cleaning methods, become a strength in fi nding the best practices and having micro or macro data base ready to be analysed by different data user.

METHODOLOGY

The intention is to fi nd, explore and use the proper and easy way to deal with method thus reducing the time for validation at the expense of other important phases of surveys that in turn require error checking. Since data from surveys often contain errors, it is desirable to detect these errors. To exemplify this, two fi les were used from a business survey sample in order to demonstrate some of the R statistical software tool functionalities with simple examples. The main advantage are increasing the effi ciency of the editing processes and make use of existing automated methods (with a proper design of consistency rules) as a partial alternative to manual editing of data. In statistical offi ces those methods and tools are used at the post data receiving stage, for indentifying and elimination of errors that could otherwise affect the collected data. Modern goals of editing statistical data, especially for business surveys, can reduce the potential for bias arising from infl uential or not-infl uential errors. Editing has a major role in the data cleaning process but its most useful role derives from its ability to provide information about the survey process, quality measures and improvements for future surveys.


Another consideration worth taking into account is the resources or time-consuming features and it has been estimated that National Statistics Institutes spend around 40% of their resources on editing and imputing data (De Waal, et al.2011). For effi ciency reasons, it may be desirable to edit at least part of a data fi le by automatic methods (see “MEMOBUST Handbook, Statistical Data Editing – Main Module”). It is recognized that the fatal errors (e.g., invalid or inconsistent entries) should be removed from the data sets in order to maintain accuracy and to facilitate further automated data processing and analysis. The goal of automatic editing is to accurately detect and treat errors and missing values in a fully automated manner, without human intervention. A recent development in NSIs is represented by the increasing use of administrative data sources, as opposed to the more traditional data collection by sample surveys, approaches and constraints for reducing response burden. It is known that not all data need to be corrected i.e. not all data containing errors need to be corrected to the smallest detail (over-editing). Studies prove (Granquist and Kovar, 1997) that there is no need to eliminate all errors in the data set to obtain reliable publication fi gure. The main products of statistical offi ces are tables containing aggregated data, so, small errors in the individual records are acceptable for proper and tend to cancel out when aggregated. But on the other hand, the infl uential errors and other relevant errors like unit measure errors and other systematic errors generally produce high impact on published fi gures. The traditional goal of edit, detect and correct errors in the collected data is very labour-intensive and time-consuming with a degree of ineffi ciency because the measurement error is not the only source of error in statistical output. Generally, there are major differences in choosing the proper technique depending on the kind of data: numerical or categorical. Many national statistical institutes (NSIs) use nowadays automatic editing. Almost automatic editing methods treat a record of data in two steps: fi rst, an attempt is made to identify the variables with erroneous or missing values (the error localisation problem) and second, new values are imputed to obtain a valid record.

RESULTS

This material aims to explain that the containing packages ‘editrules’, ‘VIMGUI’, ‘survey’ and ‘Amelia’ inside the R Project for statistical computing, with the exclusive use in editing and imputation process are performing well in localisation problems. R Project has a lot of packages implementing various functions to handle missing values and missing value imputations (note: this is only a partial


list): ‘Amelia’, ‘arrayImpute’, ‘bcv’, ‘cat’, ‘crank’, ‘CVThresh’, ‘crank’, ‘compositions’, ‘Design’, ‘dprep’, ‘eigenmodel’, ‘EMV’, ‘FAwR’, ‘Hmisc impute’, ‘imputeMDR’, ‘MADAM’, ‘mclust’, ‘Mfuzz’, ‘mi’, ‘mitools’, ‘mice’, ‘missMDA’, ‘mimR’, ‘mix’, ‘MImix’, ‘MIfuns’, ‘monomvn’, ‘mvnmle’, ‘norm’, ‘nnc’, ‘optmatch’, ‘pan’, ‘pcaMethods’, ‘prabclus’, ‘rama’, ‘randomForest’, ‘rconfi fers’, ‘relaimpo’, ‘robCompositions’, ‘rrp’, ‘scrime’, ‘SDisc’, ‘simsalabim’, ‘VIM’, ‘vmv’, ‘yaImpute’. Is useful and may solve specifi c and clearly identifi ed situations in business surveys using the content of these items in contrast with much-used SPSS through which, one could test and determine the pattern of missing data but even the SPSS 22 (latest version), provides poor option for handling missing data, even though offers the Little’s MCAR test as a measurement tool regarding data missing but do not offer parallel box plots or scatter plot matrix with information about missing/imputed values as one could make use of, in VIM. Although SPSS performs better and unravel some good imputation methods, including stochastic regression and EM imputation, there are voices considering that SPSS missing value analysis has been biased and limited in the types of imputations. This situation was improved with reference to the last fi ve versions. Still, to explain all relations between them, that latest version of SPSS traverses the space to R, using the R Integration Package for SPSS Statistics, which provides the ability to use R programming features within SPSS Statistics. This feature requires the SPSS - integration plug-in for R, installed with SPSS Statistics - Essentials for R (see SPSS 22, tutorials are available by choosing Help-Working with R). With these tools one has everything he needs to create custom procedures in R. In addition, the quality of imputation can’t be visually explored using various univariate, bivariate, multiple and multivariate plot methods as ‘VIM’ or ‘mice’ R packages can do, but returning on the general descriptive statistics submenu and then to check plots of the means and standard deviations by iteration and imputation for each scale dependent variable for which values are imputed. Other example is the identifi cation of null values and missing values using relational databases which is easy for experienced NISs staff, but sparingly time consuming. T-SQL creates an object called a rule that specifi es the acceptable values that can be inserted into that column. A rule can be any expression valid in a WHERE clause and can include elements such as arithmetic operators, relational operators, and predicates (for example, IN, LIKE, BETWEEN) but thinking on imputations using SQL, is preferred not continue the process of imputation of missing values. Even if there is the opportunity to lead this process in a DBMS without considering replacing missing by mean or median values, but by building a k-means clustering


model based on, it may be desirable or even necessary to perform a statistical analysis in a statistical package rather than in the database. On the other hand, R package ‘sqldf’ or ‘DBI’ package speaks well the language of relational databases helping to achieve the same goals that we achieve using SQL so we can fi nd again a path of overlap the lineament among R environment and other often used software packages. Using R, a record of data can be represented as a vector of fi elds or variables called its domain. Examples of variables and domains are the size class with domain (small, medium, large), the number of employees with domain (0,1,2…n) , and profi t with domain (0, ∞). Edit rules are derived from conditions that should be satisfi ed by the values of single variables or combinations of variables in records. For the purpose of automatic editing, all edit rules must be checkable per record, and may therefore not depend on values in fi elds of other records. Examples of edit rules are given below: Annual turnover ≥ 0, should be non-negative; profi t = turnover – total costs; IF (size class = “medium”) THEN (10 ≤ Number of employees < 49) for mixed data containing both character and numerical fi eld; NACE code check for validity, WHERE IN (select code from Nomenclature).

‘EDITRULES’: A PACKAGE FOR PARSING, APPLYING, AND MANIPULATING DATA CLEANING

A good way to see and test errors in an automatic way is installing R package ‘editrules’, a useful tool in detecting errors that can be expressed by checking after constructing rules based on: linear equations, restriction in well know if-else form and conditional restrictions on numerical or character type data. Automatic editing means data are checked and adjusted by computer. Those rules can be written and defi ned in a fi le .txt format. >rules1 <- editfi le(“edit.txt”) >ver1 <- violatedEdits(rules1, fi le) # indicating which record violates the rules defi ned by yourself from the original fi le used >plot(ver1) >summary(ver1) Edit violations, 650 observations, 0 completely missing (0%): - Returns NA when edits cannot be checked because of missing values in the data. - rules1- character vector with: ‘constraintsm’, ‘editset’, ‘editmatrix’ or ‘editarray’.


editname freq rel num3 41 6.3% num2 39 6% num1 16 2.5% Edit violations per record: errors freq rel 0 570 87.7% 1 65 10% 2 14 2.2% 3 1 0.2%

Graphic for edit violation rulesFigure 1

As for further localization, as stated by authors, (Van der Loo, 2011) ‘searchBest’ method gives the lowest-weight solution to the error localization problem. Apart from ‘searchBest’, there are other solvers in the backtracking object returned by ‘errorLocalizer’, namely -‘searchNext’ - search the next solution in the binary tree; -‘searchAll’ - return all solutions encountered while traversing the binary tree in a branch-and-bound manner; -‘searchBest’- returns a random lowest-weight solution if multiple are found;


- ‘reset’ - reset backtracker object to initial state; In practice, automatic editing introduced by this package is based on Fellegi-Holt paradigm (see Fellegi and Holt, 1976) which consider that the smallest (weighted) number of fi eld is settled, which allow the record to be imputed consistently. In fact, the method Fellegi-Holt only provides a list of variables ready for imputation process in order to have clean data based on edit rules, but it does not provide the fi nal value to impute. Is needed another level strictly based on accurate imputation rules. The list of edit violations seen above produces a list of fi elds and observation of violated imposed rules of editing. Edit rules (also called validity rules) impose conditions that should be satisfi ed by the values of single variables or combinations of variables in a record. Besides systematic errors, data also contain non-systematic, random errors that are not caused by a systematic reason, but randomly. An example is typing too many digits. To identify such non-systematic errors, Fellegi-Holt paradigm is suitable and recommended (De Waal, 2003) because that the data in a record should be made to satisfy the specifi ed edits by changing the fewest possible (weighted) number of fi elds. To each variable a non-negative weight, the so-called reliability weight, is assigned that indicates the reliability of the values of this variable. The higher the weight of a variable, the more reliable the corresponding values are considered to be. If all weights are equal, the generalized Fellegi-Holt paradigm reduces to the original Fellegi-Holt paradigm. This method works well for a record that contains fewer errors. Given such a minimal index set, De Wall, (De Wall, 2003) construct the implied edit, given by: IF rr Dv ∈ , I

Sj

jii Fv

∈

∈ for i=1,…,r-1,r+1,…,m THEN ∅∈),,( 1 nxx K Some records are not suitable for automatic editing and when the records contain many errors based on a set of predefi ned maximum number of errors then, the records will not be introduced in the process of automated editing but will be considered for other method such as, reweighting (see ‘survey’ R package). Rules can be defi ned with common R syntax and parsed to an internal (matrix-like format) and can be manipulated with variable elimination and value substitution methods, allowing for feasibility checks. Data can be tested against the rules and erroneous fi elds can be found based on Fellegi and Holt’s generalized principle.


The discussion that emphasizes specifi c preference or the need of a specifi c software power to use in enterprise statistical surveys conducted in the offi ces of statistics exists because our objective is getting inferential analysis accuracy, rigor and build as well as possible avoiding the generalization errors. Diagnostic procedures yield information about the nature of missing data and potential biases due to missing data. We need this evaluation since both numerical and graphical diagnostics procedures provide information by which to better handle, diagnose and interpret missing data and their impact on study results.

‘AMELIA II’: A PACKAGE FOR MISSING DATA

‘Amelia II’- “multiply imputes” missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country) or from a time-series-cross-sectional data set (such as data collected by years for each of several countries).‘Amelia II’ implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EM approaches, is usually considerably faster than existing approaches and can handle many more variables.

>install.packages(‘Amelia’, repos=”http://r.iq.harvard.edu”, type = “source”)>library(Amelia)>AmeliaView()

The imputation model in ‘Amelia II’ assumes that the complete data (that is, both observed and unobserved) are multivariate normal. Amelia requires both the multivariate normality and the MAR assumption (or the simpler special case of MCAR). When using multiple imputations, the main idea is to identify the variables to be included in the imputation model. Is needed to include at least as much information as will be used in the analysis model. This means that any variable present in the analysis model should also be in the imputation model including, of course, any transformations or interactions of variables that will appear in the analysis model.


Missingness mapFigure 2

Missing values are in tan and observed values are in red. The missing values map is an important tool for understanding the patterns of missingness in the data and indicate different ways to improve the imputation model. Variables considered are number of employees (as auxiliary variable) and annual turnover of enterprises. The correlation between those variables is important and could be a feasible solution also for small area estimation as well, in structural surveys.


Example of diagnostic graphFigure 3

The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missing values for that observation. For each observation, Amelia also plots 90% confi dence intervals that allow the user to visually inspect the behavior of the imputation model. By checking how many of the confi dence intervals cover the y = x line, one can tell how often the imputation model can confi dently predict the true value of the observation. A typical scenario for a business survey is that data for relatively small businesses with simple structures are taken or derived from tax returns, whereas surveys are used to collect data from the key units (usually those that are largest and/or have the most complex structures) ‘VIM’ : a package for visualization and imputation of missing values ‘VIMGUI()’ This package introduces new tools for the visualization of missing and/or imputed values, which can be used for exploring the data and the structure of the missing and/or imputed values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missings and allows to explore the data including missing values. In addition, the quality of imputation can be visually explored using various multiple plot methods. A graphical user interface allows an easy handling of the implemented plot methods (cran.r-project.org).


The VIM GUI and it’s menu for importing dataFigure 4

>activedataset <-spss.get(“C:/FILE.SAV”,use.value.labels=FALSE,lowernames=TRUE,force.single=TRUE,charfactor=TRUE,to.data.frame = TRUE) >originaldataset <- activedataset >matrixplot(activedataset, sortby = “c_caen”)

Matrix plot of missing valuesFigure 5


Using the function ‘matrixplot’ one can create a color matrix plot in which the data cells are represented by a colored rectangle. The data matrix plot can also be sorted by clicking inside the plot space on the variable’s column which one wants to sort by. This is an example of pattern of missing at random (MAR). From Imputation menu one can choose between many methods of missing values as: KNN, hotdeck, IRMI. The GUI has two menus for graphical methods: “Visualization” created for analysis of missing value before imputation and “Diagnostics” is designed to see the outcome after imputation process. Variables sorted by number of missings: Variable Count ca_as11 13 (annual turnover) nm_as11 10 (n.employee) c_caen 0

For exemplifi cation and due to incomplete item values we make use of hot deck method - nearest neighbor imputation, used to compensate for non-response in sample surveys. The k- Nearest Neighbor Imputation based on a variation of the Gower Distance for numerical, categorical, ordered and semi-continuous variables has the following usage:

>activedataset <- kNN(activedataset, variable= c( “nm_as11” ) , k= 5 , dist_var= c( “ca_as11” ) , weights= NULL , numFun= median , catFun= maxCat , impNA= TRUE , addRandom= FALSE , mixed= NULL , mixed.constant= NULL )


Scatterplot Matrix of the variables, imputed values in nm_as11 are highlighted

Figure 6

Statistical surveys tend to suffer from varying degrees of non-response, which affects the effi ciency of the sampling process, and the quality of the resulting statistics. Non-response typically takes one of two forms, “unit non-response”, in which no data are supplied for the unit concerned, or “item non-response”, in which a partial return is provided, but some data items are blank. A more convenient alternative may be to decide that if data not provided by a particular date, of a single units, are or not vital to the survey results (e.g. smaller businesses in a business survey), they are instead taken from administrative sources. Relying on a strong correlation between the administrative data and the survey data, the survey data can be either replaced directly with administrative data or indirectly through the production of modeled values based on the relationship between the two sets of data. As such, more and more business survey estimates are being based on a combination of survey and administrative information. NSIs are increasingly turning to the use of administrative data to reduce the cost of surveys and to reduce the burden on respondents.


The basic way for estimating is summing weighted variable values for the units that happened to be in the sample. Suppose having 100% response rate, this situation gives an unbiased estimator. This is the Horvitz-Thompson (abbreviated as H-T) estimator for the population total. A more advanced point estimator is the generalized regression (GREG) estimator. For surveys, it is important to know if, and how, non-respondents differ from respondents. This is imperative knowing if we are making correct inference from sample data. There are some main elements that characterize business survey data (Granquist, 1995). Firstly, responses to items of interest often present highly skewed distributions, in other words, a small number of units substantially contribute to the total estimate. Furthermore, information on the surveyed businesses is often available from a previous survey or can be drawn from administrative sources. As collected, micro-data often include implausible or impossible values, for example arising from multiple forms of survey error (Groves 1989), such as reporting and measurement error. NSIs prefer not to release such faulty values and so undertake a process usually referred to as “edit and imputation” (De Waal et al. 2011). Editing and imputation is a set of activities detecting erroneous and missing data and treating these data accordingly. When there are incidences of missing values in quantitative data items, such as sales and fi xed investment, the current practice is to compile the survey result by imputing the missing value with the “previous fi scal year’s value obtained from the non-responding enterprise.” Like household surveys, business surveys often use one of the following methods to account for non-response: follow-up, imputation, or weighting adjustments. Imputation is done at the unit or item level and is the process of creating non-missing by inferring from other data what a missing value “should” be. (Singh and Petroni, 1988) Unit non-response is usually treated by weighting the responding cases accordingly. In some applications even unit non-response is treated with unit imputation, meaning that one unit missing is replaced by another unit close to the fi rst one in a metric way, using nearest neighbor technique, but reweighting is perhaps the most common method because is an approach which can be use to tackle bias resulting from non-response. The main intention with reweighting the data is to adjust the original inclusion probabilities by the response probabilities. When a stratifi ed sampling survey is conducted with imperfect response it is desirable to rescale the sampling weights to observe the non-response. Edit and imputation (E&I) are known as one of the most important aspect of business surveys but a very time consuming process for NISs. The


process of dealing with data cleaning methods to strength and consider the best practices for having micro or macro data base ready to be analyzed by different data user. Some forms of imputation are known as logical or deductive imputation but mostly when dealing with item non-response as opposing the earlier discussed, unit non-response. All business surveys suffer from effect of non-response. Are well known the reasons why this happens but the important fact is that in those surveys the non-response is rarely Missing Completely at Random (MCAR). Systematic non-response patterns (MNAR) are responsible for biases in survey estimates and is imposed the use of weighting methods. Dealing with those aspects is not easy because on one hand, agreed methods refer to weighting and imputation but on the other hand considering the presence of systematic missing pattern is not always appropriate for common imputation. The assumption about randomness (MCAR, MAR, MNAR) must always be evaluated. Methods such as ratio, regression, nearest neighbor imputation are appropriate for business surveys where one can use other sources, preferred those longitudinal, in order to reduce non-response bias. Perhaps the most valuable R Package under assumption of MNAR is ‘mice’ (Multivariate Imputation by Chained Equations); it can handle both MAR and Missing Not at Random (MNAR). Multiple imputations under MNAR, requires additional modeling assumptions. The default methods given by ‘mice’ package are in partially presented in package ‘MissingDataGUI’, a GUI for Missing Data Exploration.

‘SURVEY’: A PACKAGE FOR ANALYSIS OF COMPLEX SURVEY SAMPLES

Also for non responses one can also take into account the R package ‘survey’. ‘Nonresponse()’ combines stratified tables of population size, sample size, and sample weight. ‘SparseCells’ identifies cells that need combining.

>sparseCells(nr) Cells: 3 5 7 11 Indices: strVar1 strtVar2 strtVar3 3 “No” “Yes” “E” 5 “No” “No” “H” 7 “No” “Yes” “H” 11 “No” “Yes” “M”


Summary: NR wt wt n 3 Inf Inf 0 5 3.2 108 3 7 Inf Inf 0 11 Inf Inf 0

‘Neighbors’ # describes the cells adjacent to one specified cell >neighbours(3,nr) # look at neighbours >nonresponse(object, nbour.index) # create a nonresponse object

Cells: 4 7 1 Indices: strVar1 strtVar2 strtVar3 4 “Yes” “Yes” “E” 7 “No” “Yes” “H” 1 “No” “No” “E” Summary: NR wt wt n 4 0.92 31.1 112 7 Inf Inf 0 1 1.04 35.2 12

Function ‘joinCells’ : >joinCells(nr1,3,11,8) # collapse some contiguous cells >nonresponse(sample.weight, sample.count, table) 12 original cells, 8 distinct cells remaining Joins: 3 5 7 3 5 7 8 11 counts NRweights totalwts Min. : 3.00 Min. :0.6840 Min. :23.15 1st Qu.: 7.00 1st Qu.:0.8956 1st Qu.:30.31 Median : 11.00 Median :0.9793 Median :33.15 Mean : 22.88 Mean :1.1461 Mean :38.79 3rd Qu.: 15.50 3rd Qu.:1.3142 3rd Qu.:44.48 Max. :112.00 Max. :2.0977 Max. :71.00


When the collapsing is complete, use ‘weights()’ to extract the non-response weights.

CONCLUSIONS

In this paper, several recent approaches in missing data methods for identifying missing values in data sets were tested. R has data structures for example on edit rules that allow thinking more about the statistics performed than about the internal representation of data and, on the other hand, the automation is easier to perform in R than in SPSS, with concern on validity, reliability, power of the study. The techniques and functionality discussed in this article represent a very small percentage of the available methods for identifying, displaying, and imputing missing values.

REFERENCES: 1. De Waal, T., 2000, An Optimality Proof of Statistics Netherlands’

New Algorithm for Automatic Editing of Mixed Data. Report, Statistics Netherlands, Voorburg.

2. De Waal, T. and R. Quere (2003), A Fast and Simple Algorithm for Automatic Editing of Mixed Data. Submitted to Journal of Offi cial Statistics

3. De Waal, T., Pannekoek, J., and Scholtus, S. (2011), Handbook of Statistical Data Editing and Imputation, Wiley

4. Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association 71, pp. 17-35.

5. Granquist, L. and Kovar, J.G. (1997). Editing of survey data: how much is enough? In Survey Measurement and Process Quality, Lyberg et al (eds.). Wiley, New York, 415-435

6. Groves, R. M. (1989), Survey Errors and Survey Costs, New York: Wiley. 7. Honaker, James, and Gary King. 2010. What to do About Missing Values

in Time Series Cross-Section Data. American Journal of Political Science 54, no. 3: 561-581

8. M. Templ, A. Alfons, A. Kowarik, and B. Prantner. VIM: Visualization and Imputation of Missing Values, 2011a. URL http://CRAN.R-project.org/package=VIM.

9. Memobust, Handbook on Methodology of Modern Business Statistics, 2014, www.cros-portal.eu

10. Scholtus, S. (2009). Automatic correction of simple typing errors in numerical data with balance edits. Technical Report 09046, Statistics Netherlands, Den Haag

11. Singh, R., & Petroni, R. (1988). Nonresponse Adjustment Methods for Demographic Surveys at the U.S. Bureau of the Census.

12. Van der Loo, M., E. De Jonge, and S. Scholtus (2011). Correction of rounding, typing and sign errors with the deducorrect package. Technical


Report 201119, Statistics Netherlands, The Hague. 13. R package version 3.0.0. 14. ‘Amelia’ package-Honaker, J., King, G., & Blackwell, M. (2010). Available

at: http://cran.r-project.org/web/packages/Amelia/Amelia.pdf 15. ‘editrules’ package - Edwin de Jonge, Mark van der Loo (2013). Available

at: http://cran.r-project.org/web/packages/editrules/index.html 16. ‘survey’ package – Thomas Lumley,(2013). Available at: http://cran.r-

project.org/web/packages/survey/index.html 17. ‘VIM’ package - Templ, M., Alfons, A., & Kowarik, A. (2010). Available at:

http://cran.r-project.org/web/packages/VIM/vignettes/VIM-EU-SILC.pdf


The Bayesian Modelling Of Infl ation Rate In Romania PhD Senior Researcher Mihaela SIMIONESCU (BRATU) ([email protected]) Institute for Economic Forecasting of the Romanian Academy

ABSTRACT

Bayesian econometrics knew a considerable increase in popularity in the last years, joining the interests of various groups of researchers in economic sciences and additional ones as specialists in econometrics, commerce, industry, marketing, fi nance, micro-economy, macro-economy and other domains. The purpose of this research is to achieve an introduction in Bayesian approach applied in economics, starting with Bayes theorem. For the Bayesian linear regression models the methodology of estima-tion was presented, realizing two empirical studies for data taken from the Romanian economy. Thus, an autoregressive model of order 2 and a multiple regression model were built for the index of consumer prices. The Gibbs sampling algorithm was used for estimation in R software, computing the posterior means and the standard deviations. The parameters’ stability proved to be greater than in the case of estimations based on the methods of classical Econometrics. Keywords: Bayesian econometrics, Bayesian regression, Bayes’ theorem, Gibbs sampling algorithm, posterior mean JEL Classifi cation: C11, C13, C51

INTRODUCTION

Bayesian econometrics is a later developed branch of econometrics that applies Bayes’ principle in economic modelling. An introduction in the Bayesian inference from econometrics is made by Zellner (1996) in 1971, but in last years a real informational explosion of researches that use Bayesian methods was observed. This phenomenon could be explained using solid arguments, the fi rst one being the real defi ciencies of the classical econometrics in estimating the regression models based in major cases on unrealistic assumptions compared to the empirical data. On the other hand, the rapid progress of the computational techniques made possible the application on a larger scale of the specifi c methods of the Bayesian econometrics. Moreover, the real evolution of the economic phenomena is better explained by the Bayesian approach that works out with data from samples, but also with prior


information, previously determined. A presentation of the Bayesian methods for econometrical estimations, testing and prediction of the economic indicators is realised by Zellner (1985), who showed the superiority of the Bayesian methods that includes prior information compared to the classical econometrics methods. Initially, a huge part of the literature dedicated to Bayesian econometrics referred to factorial models with an unknown number of latent factors. An introduction in modern Bayesian econometrics, highlighting the utilisation of informatics programs for estimating the Bayesian models and presenting for this context the S programming and the use of Bugs software was made by Lancaster (2004). Most of the researchers use R and Matlab programs for Bayesian estimations; in this research I used some codes written in R for estimating the linear and Bayesian regression models using Gibbs sampling method. The Matlab program is used in Bayesian inferences and estimations by Koop, Porrier and Tobias (2009) that offered many empirical examples for initiation in Bayesian econometrics secrets. A literature retrospection in this domain is made by Geweke, Koop andvan Dijk (2011), discussing the posterior simulations, the ” Markov chain Monte Carlo” (MCMC) methods, the state space models , non-parametrical techniques and fi ltering in Bayesian approach. The authors highlight the applicability of Bayesian econometrics techniques in numerous domains of the economical sciences, among them being micro-economics, macro-economics, marketing, management, fi nance, commerce. For applications of Bayesian approach not only in economical sciences, but also in medicine, natural sciences, engineering, ecology, politics, industry and sociology you can see more at O’ Hagan and West (2010). The necessity of the application of Bayesian econometrics methods is required by the objective fi nality: taking decisions at different levels and various domains. Approaches in both theoretical and practical ways are described by Heij, de Boer, Franses, Kloek and van Dijk (2004): data choice econometrics (truncated data or census data, logit and probit models, multinomial and ordered choices) and the time series econometrics (univariate time series, vector-autoregressive models, simultaneous equations models, SUR models, panel data, trend, volatility). For different Monte Carlo approaches for Bayesian analysis of the simultaneous equations models you can see more at Van Dick H.K. (2011). Bayesian inferences for clusters based on Markov chains are made by Fruhwirth-Schnatter, Pamminger, Weber and Winter-Ebmer (2011). Recently, an effi cient inference for ARMA models with switching regime was proposed by Kim J. and Kim C. (2013). The main


critiques brought to Bayesian approach, according to Gamerman and Lopes (2006), are related to the diffi culty of computing the marginal likelihood and the normalization of Bayesian factor. The advantages of Bayesian approach in economy, compared to classical modelling are numbered by Müller and Mitra (2013). Moreover, the Bayesian methods allow the study of properties of the non-optimal estimators and statistics. The disadvantages of the ordinary least squares method of estimation in the classical linear regression were mentioned by Lindley and Smith (1972). The method could not be applied for dimensions greater than 2. Transformations of the variable are made in order to achieve this condition and the hypothesis of normality is fi xed to error, these assumptions making diffi cult the estimation process. On the other hand, the lack of these conditions for Bayesian approach brings in many situations better results. The regression models used in forecasting do not fulfi l the necessity of development and frequent update. The classical regression model does not succeed in achieving these objectives. Therefore, the Bayesian regression is a very good solution to extend the low volume data series. It is actually an adaptation of famous Bayes theorem, taking into account two types of information: 1. A prior information; 2. Experimental data.

The purpose of this research is related to the following aspects: the explanation of the application of Bayes’ theorem in economics, the presentation of the linear regression model that is estimated using the Bayesian approach and the realization of two empirical examples using Gibbs sampling algorithm, specifi c to Bayesian econometrics, for explaining the evolution of some macroeconomic indicators in Romania. This research is a novelty for the Romanian literature, Bayesian estimations of linear regression models never being made before. The Bayesian approach proved to be better than the classical one, the probability distribution being determined for the coeffi cients. Thus, the decisional process at macroeconomic level is improved by considering not a punctual value of estimators, but a probability distribution, even for variance. The knowledge of errors’ variance in Bayesian approach offers the possibility of assessing and even reducing the uncertainty that affects the econometric model and the forecasts based on it.


BAYES’ THEOREM APPLIED IN ECONOMICS

If A and B are two events, the fi rst one being unknown and the second one being known, in Bayesian approach, B is associated to known data and A to model coeffi cients. The following notations are used: y- set of data y*- set of unobserved data Mi- set of models, where i=1,2,...,m θi - parameters that Mi depend on

- posterior density p(Mi/y)- posterior probability of the model ( the model is based on this model) p(y*/y)- predictive density that the forecast is based on

The conditional probability of A, when B is known, represents the probability that A takes place, when B has already taken place: pr(A/B)=pr(A,B)/pr(B).

According to Bayes’ theorem: )=

We consider only one regression model that depends on parameters

according to Bayesian theory, )= , but if

A is replaced by y and B with , then: )= (y data

being known, what can we know about ). There is a controversy between

econometricians, more of them considering that is not a random variable.

p(y) not depending on , it can be ignored and the following approximation could be done for the posterior density being computed after the knowledge of

the data: pr( /y) .

- likelihood function - posterior density, that does not depend on data


Marginal density is based on integration: = /

, being equivalent with = (1)

THE LINEAR REGRESSION MODEL IN THE BAYESIAN APPROACH. THE ESTIMATION ALGORITHM GIBBS

SAMPLING

The following regression model is considered in matrix form: , where (2)

Y matrix for the dependent variable has the dimension nx1, while X is the matrix of independent variables with the size nxk, where k is the number of independent variables and n is the number of observations of each data series The objective is the determination of the estimators matrix, the errors variance being: . The classic econometrics solves this problem by estimating and maximising a likelihood function, resulting in the end the estimator for matrix A and the estimated variance of the errors. So, the classic econometrics is based on the utilisation of all data. The Bayesian econometrics offers the following solution that supposes the successive demarche of the next steps: 1. The researcher intuits the values of parameters’ estimators, using the information regarding the A matrix and the errors’ variance, but the information is not related to the values of the data series for X and Y. The intuitions are called prior belief, being related to the researcher’s experience and to previous studies for similar models but for other data sets. It is important to notice that these beliefs are expressed as probability distributions. For example, the prior on the matrix A coeffi cients follows a normal distribution with average and

a variance-covariance matrix denoted by . More sure the researcher is

about the appreciations regarding the coeffi cients, lower the variance is. 2. The second phase is also met in the classic econometrics and it supposes the collecting of the data for X and Y and the estimation of the likelihood function:

(3)


The researcher updates the expectations regarding the model parameters using the data for X and Y and the estimated likelihood function. Practically, the prior probability distribution is combined with the likelihood function in order to get the posterior repartition. This distribution is defi ned in

the terms of Bayesian theorem: = . (4)

In other words, the prior distribution is gotten by dividing the product between the likelihood function and the prior probability by the marginal likelihood (the marginal density of data, which is a scalar). So, the prior distribution is proportional with the likelihood function by a prior number of times. Therefore, in estimating the coeffi cients of the simple regression model

the following relationship is used: .(5)

The joint density (product between the marginal density of Y and the conditional density of the parameters or the product between the marginal density of the parameters and the conditional density of data) can be computed in two ways, being denoted with:

(6)

Gibbs sampling is a numerical method used for estimation and it is applied in 3 possible cases: • Estimation of prior distribution of A under the hypothesis of known

variance of errors; • Estimation of prior distribution of variance under the hypothesis of

known A matrix; • Estimation when both parameters are unknown.

a. Estimation of prior distribution of A under the hypothesis of known variance of errors;

The next steps are followed in this case: a1. In practice a prior normal distribution is chosen for A, which is a conjugate distribution. This is combined with a likelihood function and a posterior distribution results, the repartition being the same (the normal one). The form of prior distribution of average A0 and a variance-covariance matrix

denoted by is:


(7)

a2. The likelihood function is defi ned as :

(8)

a3. The computation of posterior distribution as:

x

For the normal distribution, Hamilton (1994) and Koop (2003) used the following formulae for normal distribution average and variance:

= (9)

V*=

b. The estimation of posterior variance under the hypothesis of a

known matrix A

b1. The normal distribution admits negative values, fact that justifi es the choice of an inverse Gamma distribution or a Gamma distribution with the parameter

Let’s consider a variable denoted by “v” with normal distribution and

T numbers that are identic and independently distributed :

The sum of squares for this variable follows a Gamma distribution with the parameters T (number of degrees of freedom) and the scale parameter . The probability density function corresponding to Gamma distribution

is: . (10)

The mean of Gamma distribution is given by: (11)


The form of prior density is: (12)

b2. The likelihood function is defi ned as: (13)

b3. The computation of posterior distribution ( a Gamma distribution with degrees of freedom

as:

The mean of conditional posterior distribution is: . (14)

c. We will consider the case when both parameters are unkonwn. The three steps are the following, according to Blake and Mumtaz (2012):

c1. The computation of the joint prior distribution c2. The likelihood function c3. The computation of posterior distribution The joint posterior distribution for A and variance is: .

(15)

The inference about parameters is based on the computation of conditional

prior distributions, defi ned by Koop (2003) as: . (16)

Gibbs sampling is a numeric method for estimating the coeffi cients of the linear regression model that uses the conditional distributions to approximate the joint and the marginal repartitions. A general presentation of the method is followed by a description in the context of the linear regression model. A joint distribution of k variables is considered: . Our objective is the determination of the marginal distributions:


Starting from the conditional distributions , the Gibbs

sampling algorithm approximates the marginal repartition by following the next steps, an easier demarche than the integration of the joint distribution: Step 1: The starting values are considered: Step 2: Selection of sample from the distribution of conditional on current values of

Step 3: Selection of sample from the distribution of conditional on current values of

Step k: Selection of sample from the distribution of conditional on current values of

According to Casella and Edward (1992), if the number of iterations converges to infi nite, the draws from the conditional distributions (the samples) converges to the marginal or joint repartitions of xi at an exponential rate. So, if we arrive to a large number of steps, we can easily approximate the marginal distribution to the empirical repartition of xi. If the Gibbs algorithm is applied P times and only the last M draws of xi are retained (M values for ), the histogram for is an approximation for the marginal density of . Therefore, the estimator for the average

of the marginal posterior repartition for xi is: , where b- number of Gibbs iterations. The variance of the marginal distribution is the number of Gibbs iterations that are necessary for convergence. The form of the conditional distributions should be a prior known by

the researcher Moreover, random draws could be taken from these distributions.


GIBBS SAMPLING ALGORITHM FOR LINEAR REGRESSION. AN APPLICATION FOR INFLATION

RATE IN ROMANIA

Let us to consider the following regression model (AR(2) model) for (monthly index of consumer prices for Romania), used in computing the infl ation rate, in the period 1991: January – 2013: April: (17) The RHS variables are: 1, A= - vector of coeffi cients Objective: the approximation of the marginal distribution for coeffi cients ( ) and variance ( ) The priors and initial values are set. A normal distribution is set for the coeffi cients. This implies that the prior averages are specifi ed for each

coeffi cient (A0)- a 3x1 vector and the prior variance ( ).

(18)

An inverse Gamma distribution with prior for and the prior degrees of freedom and the prior scale matrix as . We will work with inverse Gamma distribution. (19)

The OLS estimator for is set as starting value. The large number of Gibbs iterations will determine an insignifi cant infl uence of the starting value on the results for linear regressions. Then, we sample from the conditional posterior repartition of A, having a starting value for


, -> N(M*,V*)

(20)

V*=

The following algorithm is used to compute the draw for A. Algorithm a: Let z be a k*1 vector that is sampled from a normal distribution of average m and variance v. Let be the fi rst k*1 numbers from the standard normal distribution, numbers that could be transformed in order to have the mean m and the variance v: . In our case, we have the following relationship: The draw for is computed using the previous formula, where

. - a vector from the standard normal distribution The variance is drawn from the conditional posterior distribution, being given, the repartition being an inverse Gamma one:

(21)

(22)

= +

The following algorithm is used in order to sample a scalar denoted by z from the Inverse Gamma distribution (T/2 degrees of freedom and scale parameter D/2). Algorithm b. We generate T numbers from a standard normal

distribution ( ). Then, z is computed as: , being drawn from an Inverse Gamma repartition. After getting A1, …, Ap, the last M values of A are used to form the empirical repartition of the parameters (an approximate of the marginal posterior distribution). The fi rst iterations (P-M iteration) that are not taken into consideration are considered burn-in iteration.


For estimate the linear regression in R we could use the function proposed by Professor Doug Schroeder which is available on http://fi sher.osu.edu/~schroeder.9/AMIS900/GibbsLinRegr.R . The prior average for beta is represented by ”prior”, prior variance for beta, nu0=df for sigmasq. For sigmasq we have prior scale. The output from Gibbs sampler is used to make inference. A sequence of draws from the approximate marginal distribution of the parameters is obtained. The draws average is an approximation of the posterior mean, providing a point estimate for parameters. The computed percentiles using these draws are used to get the posterior density intervals. The 5th and the 95th percentiles approximate the 10% highest posterior density intervals. The marginal likelihood (defi ned like )p(A , is the posterior distribution with the parameters integrated out. A model M1 is preferred to M2 if or the Bayesian

factor

The prior mean for the coeffi cients is:

while the prior variance is represented by the identity matrix:

. 10 000 replications were saved for this

application, the total number of iterations being 50 000. Each coeffi cient has a posterior mean and the standard deviation (table no. 1).


The coeffi cients, the posterior means and the standard deviations Table no. 1

Coeffi cient Posterior mean Standard deviation

-0.0391 0.099883

0.0063 0.010486

0.0611 0.09542

Source: own computations

The posterior means are rather low, while the standard deviations are less than 0.1, fact that suggests the parameters’ stability, the model being better than the classical autoregressive one.

CONCLUSIONS

The Bayesian approach, which knew a major development in the last 20 years, fi nds its applicability in many domains of economics, being the support for taking decisions in various conditions. The Bayesian Econometrics has many different applications, the Bayesian linear regression model being another perspective of modelling the variables’ dependences. The inclusion of prior information determines better estimations for the parameters, the situation being also refl ected by the low values of the coeffi cients compared to the models from classical econometrics. The future research directions should take into account the selection of the best prior values of the parameters based on the results provided by classical econometrics.

References

1. Blake A., and Mumtaz H., 2012. Applied Bayesian econometrics for central banks, Technical book, Center for Central banks studies, London

2. Casella, G., and Edward I. G., 1992. Explaining the Gibbs Sampler. The American Statistician, 46(3), pp. 167- 174.

3. Fruhwirth-Schnatter, S., Pamminger, C., Weber, A., and Winter-Ebmer, R., 2011. Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts Markov chain clustering.Journal of Applied Econometrics, 26, pp. 86-95

4. Lindley, D. V. and Smith, A. F. M., 1972. Bayes Estimates for the Linear Model, Journal of the Royal Statistical Society. Series B (Methodological), 34(1), pp. 1-41


5. Gamerman D., and Lopes H.F., 2006. MCMC - Stochastic Simulation for Bayesian Inference. Chapman & Hall/CRC, SUA

6. Gogonea, R.M., 2009. Statistica - bază teoretico-aplicativă pentru comerţ - servicii – turism, Editura Universitară, Bucureşti

7. Geweke J., Koop G., and van Dijk H., 2011.The Oxford Handbook of Bayesian Econometrics, Series: Oxford Handbooks in Economics, Oxford University Press, London

8. Heij C., de Boer P., Franses P.H., Kloek T. and van Dijk (2004), Econometric Methods with Applications in Business and Economics, Oxford University Press, London

9. Kim J. and Kim C., 2013. An Effi cient Bayesian Inference of Regime-Switching ARMA Models: Dynamics of Ex-Ante Real Interest Rate Under Regime Shifts. In: University of Washington,Seminar on Bayesian Inference in Econometrics and Statistics (SBIES), St. Louis, SUA 3-4 Mai 2013, Washington University.

10. Koop, G., Poirier, D. J. and Tobias, J. L., 2007. Bayesian Econometric Methods, Cambridge University Press, London.

11. Lancaster, T., 2004. An Introduction to Modern Bayesian Econometrics, Blackwell Publishing, London.

12. Müller, P. and Mitra, R., 2013. Bayesian Nonparametric Inference – Why and How.

13. Bayesian Analysis, 8(2), pp. 269-302 14. O’ Hagan, A. and West, M. (2010). The Oxford Handbook of Applied

Bayesian Analysis. Oxford University Press, London 15. Smeureanu, I. and Ruxanda, G., 2013. Consideraţii privind abordarea

stochastică în domeniul economic. Amfi teatru Economic, vol. XV, no. 34/2013.

16. van Dick, H.K., 2011.Direct and Indirect Monte Carlo Approaches for Bayesian Analysis of the Simultaneous Equations and Instrumental Variables Models: A Synthesis.In: University of Washington, Seminar on Bayesian Inference in Econometrics and Statistics (SBIES), St. Louis, SUA 27-28 Aprilie 2013, Washington University

17. Zellner, A., 1996. An Introduction to Bayesian Inference in Econometrics, Wiley Publishing House, London


Estimation procedure in Monthly retail trade survey in Serbia using R software Sofi ja SUVOCAREV (sofi [email protected]) Statistical Offi ce of the Republic of Serbia

ABSTRACT

The objective of Monthly retail trade survey (MRTS), based on the sample and on the VAT reports received from Tax administration, is to provide the data on turnover of goods in retail trade in order to measure monthly changes in turnover. Indi-ces, totals and standard errors are calculated for territory of the Republic of Serbia and the territorial units (NUTS 2). For the Republic of Serbia, these parameters are calcu-lated also by two groups and eight classes of NACE Rev. 2. The calculation is based on stratifi ed simple random sample. This paper shows how estimation procedure for these parameters is implemented in R software. Keywords: Retail trade, R software, parameter estimation.

INTRODUCTION TO THE SURVEY

The aim of the survey is to provide data on turnover in order to measure monthly changes in turnover. Indices are calculated for territory of the Republic of Serbia and the territorial units (NUTS 2) for current to previous month. New demand in 2013 is to calculate indices Republic of Serbia by two classes and 8 groups of NACE Rev. 2 which belong to the division 47. They are shown in the Table 1.


Two classes and 8 groups of NACE Rev. 2 which belong to the division 47

Table no. 1.4711 Retail sale in non-specialized stores with food, beverages or tobacco predominating4719 Other retail sale in non-specialized stores472 Retail sale of food, beverages and tobacco in specialized stores473 Retail sale of automotive fuel in specialized stores474 Retail sale of ICT equipment in specialized stores475 Retail sale of other household equipment in specialized stores476 Retail sale of cultural and recreation goods in specialized stores477 Retail sale of other goods in specialized stores478 Retail sale via stalls and markets479 Retail trade not in stores, stalls or markets

POPULATION AND FRAME

The basic set of units for MRTS is created according to the data of the Statistical business register (SBR), in January of the current year. Units are all active enterprises from division 47, and in addition, 44 enterprises whose main activity was not retail trade, but are also engaged in retail trade activity. The basic set consists of 5 parts, indicated by auxiliary variable DEO: - small, medium and large enterprises according to fi nancial report, DEO=1, 2 and 3, respectively; - budget enterprises, DEO=B; - enterprises included on purpose, whose main NACE Rev. 2 activity is not in 47, DEO=E.

The fi nal frame for the year 2013 consists of 4641 units: - all budget enterprises and those included on purpose; - all medium and large enterprises with turnover >0; - small enterprises with turnover >=1300 thousands RSD.

Table 2. shows fraction of units, turnover and number of employees by parts of the frame 2013, according to the variable DEO.


Fraction of units, turnover and number of employees by parts of the frame 2013

Table no. 2.Description of the DEO DEO No. of units

(%)Turnover

(%)No. of employees

(%)Total 74,9 100,0 98,5Small enterprises 1 73,7 99,7 93,8Medium enterprises 2 99,3 100,0 99,8Large enterprises 3 100,0 100,0 100,0Budget enterprises B 100,0 - 100,0Enterprises that are not in NACE Rev. 2 division 47 E 100,0 100,0 100,0

STRATIFICATION AND ALLOCATION

Stratifi cation of the frame units according to the part of the frame to which they belong is to fi ve classes (defi ned by the values of the variable DEO): - 1, 2, 3, B and E

Further stratifi cation of the parts DEO=1, 2, 3, B is according to - size: o into smaller (cens_m=0) and o larger (cens_m=1) - NACE Rev 2 activity.

Census strata in the MRST are those for which: - DEO=3,E or - DEO=1, 2, B and cens_m=1.

Allocation is carried out by applying the Bethel algorithm. Total number of strata is 56 and 39 of them are census strata.

ESTIMATION PROCEDURE

The Horvitz-Thompson estimations of totals and indices are calculated pursuant to the standard procedure for stratifi ed sample with random selection of units within the stratum. The R survey package is used for estimation procedure. Main concepts and parts of R code are presented in this chapter, and the complete code is given in the Annex.


Realized sample (realiz_uzorak) contains following variables: o mb - statistical unit ID o malo - NACE Rev. 2 classes o vs – indicator of the type of turnover, takes six different values (1 = total turnover of the enterprise, 2 = turnover of the enterprise based on the retail trade, 3,4,5 and 6 = turnover of the enterprise based on the retail trade by territorial units (NUTS 2)) o ppromet – turnover in the previous month o ipromet – turnover in the current month o ppdv – VAT in the previous month o ipdv – VAT in the current month o pprometb – turnover in the previous month without VAT o iprometb – turnover in the current month without VAT

Selected sample (plan_uzorak) contains following variables: o mb - statistical unit ID o mstrat – stratum ID o mnh – size of the stratum o mpnh – number of statistical units allocated in the stratum o odziv – response information for statistical units (1 = unit has reported turnover and belongs to realized sample, 2 = refused to fi ll the questionnaire, 3 = not found on the address, 4 = closed, 5 = stationary / not operating, 6 = activity not in scope of the survey, 7 = in bankruptcy, 8 = in liquidation and 9 = other, state the reason) If the variable odziv takes one of the values 1, 5, 6 or 7 for certain statistical unit, then that statistical unit is considered as part of the realized sample.

R code for the estimation procedure can be described in few steps. First of all, necessary packages are loaded using function library(), realized and selected sample which are given in excel fi le are read, using function readWorksheetFromFile(). After merging this two fi les merge() and the creation of new variable malo1 which is needed for domain estimation, path for excel fi le in which the results are going to be exported is defi ned fi le.path(). Subsets of the realized sample according to the type of turnover is taken subset(), for each subset is defi ned sampling design svydesign() and weights for such defi ned sampling design are calculated weights(). The R code produces further both unweighted and weighted totals for turnover in the previous month, turnover in the current month and indices. Note that turnover also include VAT. For such estimated parameters, standard errors, coeffi cients


of variation and confi dence intervals are also calculated. At the end, all this parameters, standard errors, coeffi cients of variation and confi dence intervals are calculated for the Republic of Serbia by two groups and eight classes of NACE Rev. 2 which belong to division 47. For description of functions for import excel fi les in R and export from R to excel fi les see [4]. For descriptions of functions for handling with complex sampling designs see [3]. For descriptions of functions for handling with complex sampling designs and for insight in the corresponding theory see [1]. For description of functions which handle with frames and other R objects see [5].

CONCLUSION

R survey package offers very effective ways for implementing different estimation procedures in MRTS. I have chosen R functions that give Horvitz-Thompson estimations for totals and indices and R function that give estimation of variance. In case of non-linear parameters (indices) function uses Taylor linearization method for estimation of variance. Since variance is one of the key indicators of quality in the sample surveys and helps the user to draw better conclusions about the statistics produced, I want to emphasize that other methods for variance estimation are developed in survey package such as: Balanced Repeated Replication or BRR, Fay’s method, Jackknife and Bootstrap method.


ANNEXCOMPLETE CODE FOR ESTIMATION PROCEDURE

>library(survey)>library(XLConnect)

>plan_uzorak<-fi le.path(“C:/Documents and Settings/sofi ja/Desktop/puz10_2013_april.xls”)>plan_uzorak<- readWorksheetFromFile(plan_uzorak, sheet=”puz10_2013”)>realiz_uzorak<-fi le.path(“C:/Documents and Settings/sofi ja/Desktop/TRG10N_za_pondere_oktobar.xls”)>realiz_uzorak<- readWorksheetFromFile(realiz_uzorak, sheet=”Sheet1”)

>uparena<-merge(plan_uzorak,realiz_uzorak,by=”mb”) >malo<-as.vector(uparena$malo)>for(i in 1:length(malo)){ if(malo[i]==4711 || malo[i]==4719) malo[i]<-malo[i] else malo[i]<-substr(malo[i],1,3)}>uparena$malo1<-malo

>pomoc<- fi le.path(“C:/Documents and Settings/sofi ja/Desktop/pom.xlsx”)

>writeWorksheetToFile(pomoc,uparena,sheet=’upar’)

>slog1<-subset(uparena,vs==1)>slog2<-subset(uparena,vs==2)>slog3<-subset(uparena,vs==3)>slog4<-subset(uparena,vs==4)>slog5<-subset(uparena,vs==5)>slog6<-subset(uparena,vs==6)

>options(survey.lonely.psu=»remove»)

>dstrat1<-svydesign(id=~1,strata=~mstrat, data=slog1, fpc=~mnh)>dstrat2<-svydesign(id=~1,strata=~mstrat, data=slog2, fpc=~mnh)>dstrat3<-svydesign(id=~1,strata=~mstrat, data=slog3, fpc=~mnh)>dstrat4<-svydesign(id=~1,strata=~mstrat, data=slog4, fpc=~mnh)>dstrat5<-svydesign(id=~1,strata=~mstrat, data=slog5, fpc=~mnh)>dstrat6<-svydesign(id=~1,strata=~mstrat, data=slog6, fpc=~mnh)


>ponderi<-as.vector(weights(dstrat1))>tabela_pon<-data.frame(slog1$mb,ponderi,slog1$rbr.x,slog1$mbops,slog1$ppromet,slog2$ppromet,slog3$ppromet,slog4$ppromet,slog5$ppromet,slog6$ppromet,slog1$ipromet,slog2$ipromet,slog3$ipromet,slog4$ipromet,slog5$ipromet,slog6$ipromet,slog1$indeks_i_na_p,slog2$indeks_i_na_p,slog3$indeks_i_na_p,slog4$indeks_i_na_p,slog5$indeks_i_na_p,slog6$indeks_i_na_p,slog1$ppdv,slog2$ppdv,slog3$ppdv,slog4$ppdv,slog5$ppdv,slog6$ppdv,slog1$ipdv,slog2$ipdv,slog3$ipdv,slog4$ipdv,slog5$ipdv,slog6$ipdv,slog1$mstrat)>writeWorksheetToFile(pomoc,tabela_pon,sheet=’tab_pon’)

>unw_ppromet<-c(sum(slog1$ppromet),sum(slog2$ppromet),sum(slog3$ppromet),sum(slog4$ppromet),sum(slog5$ppromet),sum(slog6$ppromet))>unw_ipromet<-c(sum(slog1$ipromet),sum(slog2$ipromet),sum(slog3$ipromet),sum(slog4$ipromet),sum(slog5$ipromet),sum(slog6$ipromet))>unw_indeks_i_na_p<-c(sum(slog1$ipromet)/sum(slog1$ppromet),sum(slog2$ipromet)/sum(slog2$ppromet),sum(slog3$ipromet)/sum(slog3$ppromet),sum(slog4$ipromet)/sum(slog4$ppromet),sum(slog5$ipromet)/sum(slog5$ppromet),sum(slog6$ipromet)/sum(slog6$ppromet))*100>vs<-c(1,2,3,4,5,6)>naziv_vs<-c(“Ukupan promet”,”Promet trg. na malo”,”Srbija-sever”,”Beogradski region”,”Region Vojvodine”,”Srbija jug”)>unw_promet<-data.frame(vs,naziv_vs,unw_ppromet,unw_ipromet,unw_indeks_i_na_p)>writeWorksheetToFile(pomoc,unw_promet,sheet=’unw_prom’)

#ESTIMATES WITH VAT

>w_ppromet<-c(round(svytotal(~slog1$ppromet,dstrat1),2),round(svytotal(~slog2$ppromet,dstrat2),2),round(svytotal(~slog3$ppromet,dstrat3),2),round(svytotal(~slog4$ppromet,dstrat4),2),round(svytotal(~slog5$ppromet,dstrat5),2),round(svytotal(~slog6$ppromet,dstrat6),2))


>w_ipromet<-c(round(svytotal(~slog1$ipromet,dstrat1),2),round(svytotal(~slog2$ipromet,dstrat2),2),round(svytotal(~slog3$ipromet,dstrat3),2),round(svytotal(~slog4$ipromet,dstrat4),2),round(svytotal(~slog5$ipromet,dstrat5),2),round(svytotal(~slog6$ipromet,dstrat6),2))>w_indeks_i_na_p<-round(c(as.numeric(as.vector(svyratio(~slog1$ipromet,~slog1$ppromet,dstrat1)))[1], as.numeric(as.vector(svyratio(~slog2$ipromet,~slog2$ppromet,dstrat2)))[1],as.numeric(as.vector(svyratio(~slog3$ipromet,~slog3$ppromet,dstrat3)))[1],as.numeric(as.vector(svyratio(~slog4$ipromet,~slog4$ppromet,dstrat4)))[1],as.numeric(as.vector(svyratio(~slog5$ipromet,~slog5$ppromet,dstrat5)))[1],as.numeric(as.vector(svyratio(~slog6$ipromet,~slog6$ppromet,dstrat6)))[1])*100,2)

>se_w_ppromet<-c( as.data.frame(svytotal(~slog1$ppromet,dstrat1))[,2], as.data.frame(svytotal(~slog2$ppromet,dstrat2))[,2],as.data.frame(svytotal(~slog3$ppromet,dstrat3))[,2], as.data.frame(svytotal(~slog4$ppromet,dstrat4))[,2], as.data.frame(svytotal(~slog5$ppromet,dstrat5))[,2], as.data.frame(svytotal(~slog6$ppromet,dstrat6))[,2])>se_w_ipromet<-c( as.data.frame(svytotal(~slog1$ipromet,dstrat1))[,2], as.data.frame(svytotal(~slog2$ipromet,dstrat2))[,2], as.data.frame(svytotal(~slog3$ipromet,dstrat3))[,2], as.data.frame(svytotal(~slog4$ipromet,dstrat4))[,2], as.data.frame(svytotal(~slog5$ipromet,dstrat5))[,2], as.data.frame(svytotal(~slog6$ipromet,dstrat6))[,2])>se_w_indeks_i_na_p<-c(round(sqrt(as.numeric((as.vector(svyratio(~slog1$ipromet,~slog1$ppromet,dstrat1))[2])))*100,2),round(sqrt(as.numeric((as.vector(svyratio(~slog2$ipromet,~slog2$ppromet,dstrat2))[2])))*100,2),round(sqrt(as.numeric((as.vector(svyratio(~slog3$ipromet,~slog3$ppromet,dstrat3))[2])))*100,2),round(sqrt(as.numeric((as.vector(svyratio(~slog4$ipromet,~slog4$ppromet,dstrat4))[2])))*100,2),round(sqrt(as.numeric((as.vector(svyratio(~slog5$ipromet,~slog5$ppromet,dstrat5))[2])))*100,2),


round(sqrt(as.numeric((as.vector(svyratio(~slog6$ipromet,~slog6$ppromet,dstrat6))[2])))*100,2))

>cv_w_ppromet<-round(c(as.numeric(cv(svytotal(~slog1$ppromet,dstrat1))),as.numeric(cv(svytotal(~slog2$ppromet,dstrat2))),as.numeric(cv(svytotal(~slog3$ppromet,dstrat3))),as.numeric(cv(svytotal(~slog4$ppromet,dstrat4))),as.numeric(cv(svytotal(~slog5$ppromet,dstrat5))),as.numeric(cv(svytotal(~slog6$ppromet,dstrat6))))*100,2)>cv_w_ipromet<-round(c(as.numeric(cv(svytotal(~slog1$ipromet,dstrat1))),as.numeric(cv(svytotal(~slog2$ipromet,dstrat2))),as.numeric(cv(svytotal(~slog3$ipromet,dstrat3))),as.numeric(cv(svytotal(~slog4$ipromet,dstrat4))),as.numeric(cv(svytotal(~slog5$ipromet,dstrat5))),as.numeric(cv(svytotal(~slog6$ipromet,dstrat6))))*100,2)>cv_w_indeks_i_na_p<round(c(as.numeric(cv(svyratio(~slog1$ipromet,~slog1$ppromet,dstrat1))),as.numeric(cv(svyratio(~slog2$ipromet,~slog2$ppromet,dstrat2))),as.numeric(cv(svyratio(~slog3$ipromet,~slog3$ppromet,dstrat3))), as.numeric(cv(svyratio(~slog4$ipromet,~slog4$ppromet,dstrat4))),as.numeric(cv(svyratio(~slog5$ipromet,~slog5$ppromet,dstrat5))),as.numeric(cv(svyratio(~slog6$ipromet,~slog6$ppromet,dstrat6))))*100,2)

#CONFIDENCE INTERVALS

>ci1_ppromet<-c(confi nt(svytotal(~slog1$ppromet,dstrat1)))>ci2_ppromet<-c(confi nt(svytotal(~slog2$ppromet,dstrat2)))>ci3_ppromet<-c(confi nt(svytotal(~slog3$ppromet,dstrat3)))>ci4_ppromet<-c(confi nt(svytotal(~slog4$ppromet,dstrat4)))>ci5_ppromet<-c(confi nt(svytotal(~slog5$ppromet,dstrat5)))>ci6_ppromet<-c(confi nt(svytotal(~slog6$ppromet,dstrat6)))

>dci_ppromet<-c(ci1_ppromet[1],ci2_ppromet[1],ci3_ppromet[1],ci4_ppromet[1],ci5_ppromet[1],ci6_ppromet[1])>gci_ppromet<-c(ci1_ppromet[2],ci2_ppromet[2],ci3_ppromet[2],ci4_ppromet[2],ci5_ppromet[2],ci6_ppromet[2])


>ci1_ipromet<-c(confi nt(svytotal(~slog1$ipromet,dstrat1)))>ci2_ipromet<-c(confi nt(svytotal(~slog2$ipromet,dstrat2)))>ci3_ipromet<-c(confi nt(svytotal(~slog3$ipromet,dstrat3)))>ci4_ipromet<-c(confi nt(svytotal(~slog4$ipromet,dstrat4)))>ci5_ipromet<-c(confi nt(svytotal(~slog5$ipromet,dstrat5)))>ci6_ipromet<-c(confi nt(svytotal(~slog6$ipromet,dstrat6)))

>dci_ipromet<-c(ci1_ipromet[1],ci2_ipromet[1],ci3_ipromet[1],ci4_ipromet[1],ci5_ipromet[1],ci6_ipromet[1])>gci_ipromet<-c(ci1_ipromet[2],ci2_ipromet[2],ci3_ipromet[2],ci4_ipromet[2],ci5_ipromet[2],ci6_ipromet[2])

>ci1_indeks_i_na_p<-c(confi nt(svyratio(~slog1$ipromet,~slog1$ppromet,dstrat1)))>ci2_indeks_i_na_p<-c(confi nt(svyratio(~slog2$ipromet,~slog2$ppromet,dstrat2)))>ci3_indeks_i_na_p<-c(confi nt(svyratio(~slog3$ipromet,~slog3$ppromet,dstrat3)))>ci4_indeks_i_na_p<-c(confi nt(svyratio(~slog4$ipromet,~slog4$ppromet,dstrat4)))>ci5_indeks_i_na_p<-c(confi nt(svyratio(~slog5$ipromet,~slog5$ppromet,dstrat5)))>ci6_indeks_i_na_p<-c(confi nt(svyratio(~slog6$ipromet,~slog6$ppromet,dstrat6)))

>dci_indeks_i_na_p<-c(ci1_indeks_i_na_p[1],ci2_indeks_i_na_p[1],ci3_indeks_i_na_p[1],ci4_indeks_i_na_p[1],ci5_indeks_i_na_p[1],ci6_indeks_i_na_p[1])*100>gci_indeks_i_na_p<-c(ci1_indeks_i_na_p[2],ci2_indeks_i_na_p[2],ci3_indeks_i_na_p[2],ci4_indeks_i_na_p[2],ci5_indeks_i_na_p[2],ci6_indeks_i_na_p[2])*100

# DOMAIN ESTIMATES

>d_ppromet<-as.vector(svyby(~ppromet,~malo1, dstrat2, svytotal, keep.var=TRUE))[,2]>d_ipromet<-as.vector(svyby(~ipromet,~malo1, dstrat2, svytotal, keep.var=TRUE))[,2]>d_indeks_i_na_p<-round(as.vector(svyby(~ipromet, by=~malo1, denominator=~ppromet, design=dstrat2, svyratio))[,2]*100,2)

>d_se_ppromet<-as.vector(svyby(~ppromet,~malo1, dstrat2, svytotal, keep.var=TRUE))[,3]>d_se_ipromet<-as.vector(svyby(~ipromet,~malo1, dstrat2, svytotal, keep.var=TRUE))[,3]


>d_se_indeks_i_na_p<-round(as.vector(svyby(~ipromet, by=~malo1, denominator=~ppromet, design=dstrat2, svyratio))[,3]*100,2)

>d_cv_ppromet<-round(as.vector(cv(svyby(~ppromet,~malo1, dstrat2, svytotal, keep.var=TRUE)))*100,2)>d_cv_ipromet<-round(as.vector(cv(svyby(~ipromet,~malo1, dstrat2, svytotal, keep.var=TRUE)))*100,2)>d_cv_indeks_i_na_p<-round(as.vector(cv(svyby(~ipromet, by=~malo1, denominator=~ppromet, design=dstrat2, svyratio)))*100,2)

#CONFIDENCE INTERVALS

>d_ci_ppromet<-c(confi nt(svyby(~ppromet, ~malo1, dstrat2, svytotal, keep.var=TRUE)))>d_ci_ipromet<-c(confi nt(svyby(~ipromet, ~malo1, dstrat2, svytotal, keep.var=TRUE)))>d_ci_indeks_i_na_p<-c(confi nt(svyby(~ipromet, by=~malo1, denominator=~ppromet, design=dstrat2, svyratio)))

>dd_ci_ppromet<-d_ci_ppromet[1:(length(d_ci_ppromet)/2)]>gd_ci_ppromet<-d_ci_ppromet[((length(d_ci_ppromet)/2)+1):length(d_ci_ppromet)]

>dd_ci_ipromet<-d_ci_ipromet[1:(length(d_ci_ipromet)/2)]>gd_ci_ipromet<-d_ci_ipromet[((length(d_ci_ipromet)/2)+1):length(d_ci_ipromet)]

>dd_ci_indeks_i_na_p<-d_ci_indeks_i_na_p[1:(length(d_ci_indeks_i_na_p)/2)]*100>gd_ci_indeks_i_na_p<-d_ci_indeks_i_na_p[((length(d_ci_indeks_i_na_p)/2)+1):length(d_ci_indeks_i_na_p)]*100

>d_ci_pprometb<-c(confi nt(svyby(~pprometb, ~malo1, dstrat2, svytotal, keep.var=TRUE)))>d_ci_iprometb<-c(confi nt(svyby(~iprometb, ~malo1, dstrat2, svytotal, keep.var=TRUE)))>d_ci_indeks_i_na_pb<-c(confi nt(svyby(~iprometb, by=~malo1, denominator=~pprometb, design=dstrat2, svyratio)))

>dd_ci_pprometb<-d_ci_pprometb[1:(length(d_ci_pprometb)/2)]>gd_ci_pprometb<-d_ci_pprometb[((length(d_ci_pprometb)/2)+1):length(d_ci_pprometb)]


>dd_ci_iprometb<-d_ci_iprometb[1:(length(d_ci_iprometb)/2)]>gd_ci_iprometb<-d_ci_iprometb[((length(d_ci_iprometb)/2)+1):length(d_ci_iprometb)]

>dd_ci_indeks_i_na_pb<-d_ci_indeks_i_na_pb[1:(length(d_ci_indeks_i_na_pb)/2)]*100>gd_ci_indeks_i_na_pb<-d_ci_indeks_i_na_pb[((length(d_ci_indeks_i_na_pb)/2)+1):length(d_ci_indeks_i_na_pb)]*100

>trg10_intpov_eu<-data.frame(d_indeks_i_na_p,dd_ci_indeks_i_na_p,gd_ci_indeks_i_na_p,d_ppromet,dd_ci_ppromet,gd_ci_ppromet,d_ipromet,dd_ci_ipromet,gd_ci_ipromet,d_indeks_i_na_pb,dd_ci_indeks_i_na_pb,gd_ci_indeks_i_na_pb,d_pprometb,dd_ci_pprometb,gd_ci_pprometb,d_iprometb,dd_ci_iprometb,gd_ci_iprometb)>writeWorksheetToFile(pomoc,trg10_intpov_eu,sheet=’trg10_intpov_eu’)

References

1. Thomas Lumley. Complex Surveys – A Guide to Analysis Using R. John Wiley & Sons, New York, 2010.

2. Olga Melovski Trpinac. Monthly Retail Trade Survey – Working paper. SORS, 2013.

3. Thomas Lumley. Package – ‘survey’, available at http://cran.r-project.org/web/packages/survey/survey.pdf, 2013.

4. Thomas Lumley. Package – ‘XLConnect’, available at http://cran.r-project.org/web/packages/XLConnect/XLConnect.pdf, 2014.

5. Michael J. Crawley . The R Book. John Wiley & Sons Ltd, England, 2007.


Development and Current Practice in Using R at Statistics Austria Matthias TEMPL Statistics Austria, Vienna University of Technology Alexander KOWARIK Bernhard MEINDL Statistics Austria

Abstract: The popularity of R is increasing in national statistical officesnot only for simulation tasks. Nowadays R is also used in the productionprocess. A lot of new features for various tasks in official statistics have beendeveloped over the last years and these features are freely available in theform of add-on package.In this contribution we first give an outline of the use of R at Statistics Aus-

tria. Discussed is the necessary infrastructure according to the R-installation,the teaching of employees and the support provided to the staff who use R intheir daily work.In the second part, the R developments from the methods unit at Statistics

Austria are summarised. The developed packages include methods for datapre-processing (e.g imputation) up to packages for the final disseminationof data including packages for statistical disclosure control, estimation ofindicators and the visualisation of results.Keywords: Official Statistics, Computational Statistics, R

JEL Classification: C630 Computational Techniques; Simulation Modeling1 2

1 R Software Features

R is a free and open-souce environment for statistical computing and graph-ics. It includes a well-structured function- and object-oriented programminglanguage. Nowadays, R is already the state-of-the-art software for statisticalcomputing in academics but also gains importance in statistical offices aswell as in private enterprises.


R is termed an environment because it features beside well-developed func-tionality for data manipulation, operators for calculations with vectors, ma-trices, arrays and tools for data analysis and graphics. Additionally, theinteraction with other well established software packages is a major strengthof R:

• interfaces to other programming languages such as C, C++, Java orPython

• excellent import/export tools for data exchange in csv, excel, SDMX,XML, Stata, SPSS, SAS (Xport sas7bdat), JSON, fixed width format,binary formats

• functions that allow connections to important databases, e.g. DB2(ODBC, JDBC), MySQL, PostgreSQL, Oracle

In R functions, classes and methods can be defined and created by users,this provides much more freedom and flexibility than for example a macrolanguages. It should be mentioned that users have access to the same toolsas developers. This is one of the reasons why there are almost 6000 add-on-packages ready to be downloaded from the comprehensive R archive network(CRAN).

2 Policy of Using R in Statistics Austria

R is an open source project and the support is - just as with the tradition ofother open source projects - given by the community. Since the R-communityis large and very strong, chances are high that the community detects possiblebugs in packages in due time, and - in contradiction to most commercialsoftware developers - bugs are fixed soon.Without doubt R is one of the most used software in academics to teach

students in statistics. Thus, many former students that start working inofficial statistics or in companies are familiar with R and want to use itwhen they start to work in statistical agencies, see also van der Loo [2012],Gentlemen [2009].

2.1 Infrastructure at Statistics Austria

R is available for almost all operating systems including the current and lastversions of Windows, OSX and all popular Linux distributions. At StatisticsAustria, R is currently installed on more than 60 computers (Windows 7platform) and on powerful virtual servers featuring a PowerPPC architectureand SUSE Linux Enterprise Server. The server solution is mainly used fortasks that involve large memory requirements or use multiple cores in parallel.The leading R-Team at Statistics Austria consists of three experts from the

methods division. In addition, each department has chosen one person asfirst contact person for questions and problems that can easily be answered.Furthermore, the following organizational setup is in place:

• the R-experts at the methods unit (the administrators) take care ofthe version of R, they decide on the GUI-front-end and the packagesinstalled with the default installation. All necessary information and


files (R, Rstudio1, packages, documentation and examples) are placedon a particular server.

• the IT department takes these files and deploys the R-installation in-cluding the front-end (RStudio) to users. This ensures that only onestandardised software package is installed on all computers;

• the general R-support is centralised through a mailing list (apart fromdirect questions that can be answered by first-contact persons);

• an internal wiki was created and is used to collect information in aknowledge infrastructure;

• the administrators define access rights for users on the servers, the mail-ing list, wiki, file depot, etc. At the moment, basically two user groupsexist:

1. R administrators:they have read and write access and are responsible for the folderthat contains the entire software package, documentation, wiki, etc.Additionally, administrators have full access to the mailing list “R-Support”.

2. R-Team:team-members have read-only access to the R documentation andread and write access to the wiki and are members of a R-Usermailing list.

3 Education

Tbere are currently two courses about R offered to employees at StatisticsAustria.

3.1 Basic Course

The basic course is scheduled for 15 hours (5 x 3h) with the aim to bringall participants to a certain level of knowledge. The target group not onlyconsists of beginners, but also of regular R users who learned R in self-study.For the later group, many fundamental insights to the software are presentedwhich are mostly new even for experienced users.The course consists of the topics data types, import/export (including data

base connections), syntax, data manipulation (including the presentation ofimportant add-on packages such as plyr and data.table) and basic object-orientation features. Ex-cathedra teaching is followed by exercises for thestudents and R sessions in which the trainers interactively give additionalinsights.

3.2 Advanced Course

The advanced training also consists of 15 hours with the aim to teach somemore complex topics about R.

1http://www.rstudio.org


The course consists of the topics graphics (graphics, grid, lattice, ggplot2 ),classes and object-orientation (S3 plus brief introduction to S4 classes), dy-namic reporting (Sweave, knitr, brew, markup), R development issues (pro-filing, debugging, benchmarking, basic packaging), web-applications (shiny)and gives an overview of other useful packages for key-tasks in official statis-tics. Again, ex-cathedra teaching is followed by exercises and interactivesessions.

3.3 Usage of R in Methodological Courses

At Statistics Austria, methodological training is offered to the staff. In thesecourses, R is intensively used for teaching purposes in a way that participantsdo not get in direct contact with R. The reason is that for these courses norequirements with respect to any software or programming skills should benecessary. Thus, a blended learning system was developed. At the begin ofa course, participants fill out an online questionnaire. The collected data isthen automatically used in the exercises and is also incorporated into thepresentation slides. Participants are able to identify their data in variousgraphics, tables and other output. Approx. after 20 minutes lecturing, par-ticipants have to do exercises using point and click directly in the browser. Anonline server-client based tool was developed which includes (among others)single- and multiple choice questions, animated and interactive examples. Allclicks and answers from the participants are automatically and anonymouslysaved on the server and aggregated statistics (feedback) are generated au-tomatically with the aim that both the trainers and the participants get anoverview if the examples were correctly solved.

Figure 3.1 Main view of the blended learning system TGUIonline.

Figure 3.1 shows the start screen of the developed teaching system inwhich two different views are implemented. In the teacher-interface, trainerscan activate certain examples which are then available to the course partic-


ipants. They also get feedback how many participants have already solvedthe question and the correctness of the solutions. In the student-interface,the currently available exercises are listed and can be started.

4 Packages for Official Statistics

4.1 R Task View on Official Statistics

The CRAN Task View on Official Statistics and Survey Methodology listsand briefly describes relevant packages that can be used for important tasksin official statistics. The following topics are considered:

• complex survey design;

• editing and visual inspection of microdata;

• imputation;

• statistical disclosure control;

• seasonal adjustment;

• statistical record matching;

• small area estimation;

• indices and indicators.

We refer to the CRAN Task view for further information:http://cran.r-project.org/web/views/OfficialStatistics.html

4.2 Packages Developed by the Methods Unit at StatisticsAustria

The methods unit at Statistics Austria has been using R since 2004. Untilrecently, SAS was the only software allowed in the statistical productionprocess, but nowadays R starts to replace SAS in tasks, especially with needof modern and emerging methods.The aim is to implement new methods in R and to provide the developed

packages to various projects with the goal to increase the usage of R anddecrease the dependence on SAS.keep on using R for various projects and write code for new projects in

R only.The following packages have been developed by the methods unit (partially

together with other organisations):

• sdcMicro [Templ et al., 2014], sdcMicroGUI [Kowarik et al., 2013b],sdcTable [Meindl, 2013]: packages for statistical disclosure control;

• TGUI: blended-learning software for teaching;

• VIM [Templ et al., 2012]: visualisation and imputation of missing values;

• x12 [Kowarik and Meraner, 2014a], x12GUI [Kowarik and Meraner,2014b]: batch processing and interactive visualisation of X12-ARIMA

• sparkTable [Kowarik et al., 2013a]: sparklines in R for tables withgraphics for LATEX and websites


> g i n i ( ”eqIncome” , weights = ”rb050” , breakdown=”db040” ,> data = eu s i l c )

Value :[ 1 ] 26 .48962

Value by stratum :stratum value

1 Burgenland 32 .054892 Car inth ia 25 .494483 Lower Austr ia 25 .937374 Salzburg 25 .016525 S ty r i a 23 .711906 Tyrol 25 .248817 Upper Austr ia 25 .492028 Vienna 28 .949449 Vorar lberg 28 .74120

Listing 1 Example from package laeken. Estimation of the gini coeficient with breakdown onregions. More advanced features like robust estimation and variance estimation areincluded in the package but not shown here.

• laeken [Alfons and Templ, 2013]: point and variance estimation ofpoverty indicators;

• robCompositions [Templ et al., 2011a]: statistical methods for compo-sitional data.

A short overview about selected packages is now given.

4.2.1 R Package laeken

Units sampled from finite populations typically feature unequal samplingweights, this has to be taken into account when indicators have to be esti-mated. Additionally, many indicators are non-robust and suffer from stronginfluence of outliers, which are present in virtually all real-world data sets.The R package laeken [Alfons and Templ, 2013] is an object-oriented method-ological and computational framework for the estimation of indicators fromcomplex survey samples via standard or robust methods. It provides a classstructure to allow for easy handling of the functions and objects. Some widelyused social exclusion and poverty indicators are implemented together witha calibrated bootstrap framework to estimate the variance of indicators forcommon survey designs. An example is shown in Listing 1 in which the Ginicoefficient is estimated for some regions. The application of more advancedmethods (robustification, variance-estimation and plots) is shown in Alfonsand Templ [2013], Alfons et al. [2013].

4.2.2 R Package sparkTable

Package sparkTable [Kowarik et al., 2013a] provides additional insights intotext and tables by the use of small graphics (sparks) in text and graphi-

cal tables. Using sparkTable, sparklines (time series) , boxplots

and bar charts can be produced. Finetuning is possible


> sdc <− pr imarySuppress ion ( sdc , type = ” f r e q ” , maxN = 10)> resHYPER <− protectTable ( sdc , method=”HYPERCUBE” )

Listing 2 Example code from sdcTable. Primary and secondary cell suppression of an objectof class sdcProblem using the hypercube method.

by highlighting specific values, changing colours and or by including statisticsin the graphics (e.g. interquantile range), e.g..Figure 4.1 shows the use of sparklines in a graphical table.

Figure 4.1 Graphical table produced by sparkTable showing monthly production indices from2005 till 2010.

4.2.3 R Package sdcTable

The sdcTable package [Meindl, 2013] provides methods to generate instancesof multidimensional, hierarchical table structures, identify primary sensitivetable cells within such objects and finally protect primary sensitive tablecells by solving the secondary cell suppression problem with currently threeimplemented algorithms. First an object of class sdcProblem needs to becreated. In this step, all possible hierarchies have to be defined and specificsof the table must be listed. After the creation of such an object, the applica-tion of primary and secondary cell suppression methods is straightforward touse (see Listing 2, in which primary and secondary cell suppression is appliedto object ’sdc’ that contains the table that was specified by the user).

4.2.4 R Packages sdcMicro + sdcMicroGUI

The R package sdcMicro [Templ et al., 2014] serves as an easy-to-handle,object-oriented S4 class implementation of SDC methods to evaluate andanonymize confidential micro-data sets. All popular disclosure risk and per-turbation methods are included. Furthermore, frequency counts, individual


> require ( ” sdcMicro ” ) ; data ( ” t e s tda ta ” )> sdc <− createSdcObj ( te s tdata ,> keyVars=c ( ’ urbrur ’ , ’ water ’ , ’ sex ’ , ’ age ’ ) ,> numVars=c ( ’ expend ’ , ’ income ’ , ’ s av ings ’ ) ,> pramVars=c ( ” wa l l s ” ) , w=’ sampling weight ’ , hhId=’ o r i hid ’ )> print ( sdc , ” r i s k ” )

−−−−−−−−−−−−−−−−−−−−−−−−−−0 obs . with h igher r i s k than the main partExpected no. o f re− i d e n t i f i c a t i o n s :24 . 78 [ 0 . 54 %]

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Hi e r a r c h i c a l r i s k−−−−−−−−−−−−−−−−−−−−−−−−−−Expected no. o f re− i d e n t i f i c a t i o n s :117 . 2 [ 2 . 56 %]

> sdc <− l o c a l Supp r e s s i on ( sdc )> sdc <− microaggregat ion ( sdc )

Listing 3 Example from package sdcMicro. Creating an object of class sdcMicroObj andapplying local suppression to achieve k-anonymity and microaggregation.

and global risk measures, information loss and data utility statistics are up-dated (re-calculated) after each anonymization step automatically.All methods are highly optimized in terms of computational speed. It

is possible to work with large data sets like large survey data from India.Reporting facilities that summarize the anonymization process can also easilybe used by subject matter specialists and also helps to be reproducible.In Listing 3, the package is shown in action. An object of class sdcMicroObj

is created first, then the risk is printed and local suppression and microag-gregation are applied to this object. Automatically, the risk and utilitymeasures are updated and all this information is saved in the object. ThesdcMicroGUI package [Kowarik et al., 2013b] is especially useful for userswith limited knowledge in R but also R experts may use its recoding of vari-ables facilities.

4.2.5 R Packages VIM and VIMGUI

The package VIM [Templ et al., 2012] contains visualization techniques toexplore the structure of non-complete data sets. Thus, it is not just possibleto analyse the structure and relations of missing and non-missing data parts,but also to analyse imputed data. The visualization techniques for missingvalues are described in detail in [Templ et al., 2012].In addition, in VIM various kinds of imputation methods are included. The

choice of available methods is quite extensive and ranges from old-fashionedmethods like hot-deck imputation to quite sophisticated methods such asiterative step-wise robust regression imputation [Templ et al., 2011b]. Thepackage can also deal with survey objects from R package survey.The package VIMGUI is a point and click graphical user interface based on

VIM. A screenshot of a simple plot is shown in Figure 4.2.


> data ( t e s tda ta ) ; x <− t e s tda ta$wna> imp <− i rmi (x , mixed=c ( ”m1” , ”m2” )

Listing 4 Example from package VIM applying model-based iterative robust imputation ondata including missing values. Semi-continuous variables have to be specified (pa-rameter mixed).

Figure 4.2 Screenshot from the VIMGUI package. Simple aggregation statistics. In the leftplot, the number of missing values for each variable are shown while on the rightthe pattern structure of missing values are displayed.

> s <− new( ” x12S ing l e ” , ts=AirPassengers )> s <− setP ( s , l i s t ( arima.model=c (2 , 1 , 1 ) , ar ima.smodel=c ( 2 , 1 , 1 ) ) )> r e s u l t <− x12 ( s )

Listing 5 Simple example from package x12. An x12 object is created plus some parametersare set and finally x12 is called with these parameter settings.


4.2.6 R Packages x12 and x12GUI

Different components (mainly: seasonal component, trend component, out-lier component and irregular component) of a monthly or quarterly timeseries can be extracted and a moving holiday effect, a trading day effectand user-defined regressors can be estimated. Computational basis is theX-12-ARIMA seasonal adjustment software of the U.S. Census Bureau. Thex12 package [Kowarik and Meraner, 2014a] calls and extracts the output fromX-12-ARIMA and prepares the resulting output for further processing. Thepackages serves as an abstraction layer for batch processing X-12-ARIMA.New facilities for marking outliers, batch processing and change trackingmake the package a powerful and functional tool.In Listings 5 a simplified example is shown. A single time series is chosen,

parameters are specified and the object is finally evaluated. For the resultingobject, print, summary and various plot methods are available.With the x12GUI package [Kowarik and Meraner, 2014b] users can inter-

actively select additive outliers, level shifts and temporary changes and theimpact is visible immediately. Figure 4.3 shows one view of x12GUI.

Figure 4.3 View of one window of the x12GUI package for seasonal adjustment.


5 Conclusion

Nowadays, most new employees with academical background in statisticsalready have skills in R and are highly motivated to continue using R. Bycreating an infrastructure for R including training and support, the usageof R in the statistical production process is feasible and often preferable toother software solutions.The use of specific R-packages related to methods from official statistics

makes it possible to tackle problems that are not (easily) solvable with otherstatistical software packages. This includes survey sampling, calibration,editing, imputation, disclosure control as well as estimation and visualisation.For some of these packages we provided brief information and we showed

simplified examples. The aim was that interested readers become aware ofthe power of R in official statistics.New collaborations between countries seem possible since everybody can

use the packages for free. Intellectual rights, however, should be respected.

References

A. Alfons and M. Templ. Estimation of social exclusion indicators fromcomplex surveys: The R package laeken. Journal of Statistical Software,54(15):1–25, 9 2013.

A. Alfons, M. Templ, and P. Filzmoser. Robust estimation of economicindicators from survey samples based on pareto tail modelling. Journalof the Royal Statistical Society: Series C (Applied Statistics), 62(2):271–286, 2013. ISSN 1467-9876. doi: 10.1111/j.1467-9876.2012.01063.x. URLhttp://dx.doi.org/10.1111/j.1467-9876.2012.01063.x.

R. Gentlemen. Data analysts captivated by R’ power, 2009. URL http:

//www.nytimes.com/2009/01/07/technology/business-computing/

07program.html?pagewanted=all&_r=0.

A. Kowarik and A. Meraner. x12: x12 - wrapper function and structure forbatch processing, 2014a. URL http://CRAN.R-project.org/package=

x12. R package version 1.5.0.

A. Kowarik and A. Meraner. x12GUI: X12 - Graphical User Interface, 2014b.URL http://CRAN.R-project.org/package=x12GUI. R package version0.12.0.

A. Kowarik, B. Meindl, and M. Templ. sparkTable: Sparklines and graphicaltables for tex and html, 2013a. R package version 0.9.7.

A. Kowarik, M. Templ, B. Meindl, and B. Fonteneau. sdcMicroGUI: Graph-ical user interface for package sdcMicro, 2013b. URL https://github.

com/alexkowa/sdcMicroGUI. R package version 1.1.2.

B. Meindl. sdcTable: Methods for statistical disclosure control in tabulardata, 2013. URL http://CRAN.R-project.org/package=sdcTable. Rpackage version 0.10.3.


M. Templ, K. Hron, and P. Filzmoser. robCompositions: An R-package for Robust Statistical Analysis of Compositional Data, pages341–355. John Wiley & Sons, Ltd, 2011a. ISBN 9781119976462.doi: 10.1002/9781119976462.ch25. URL http://dx.doi.org/10.1002/

9781119976462.ch25.

M. Templ, A. Kowarik, and P. Filzmoser. Iterative stepwise regression im-putation using standard and robust methods. Comput Stat Data Anal, 55(10):2793–2806, 2011b.

M. Templ, A. Alfons, and P. Filzmoser. Exploring incomplete data usingvisualization techniques. Advances in Data Analysis and Classification, 6(1):29–47, 2012. doi: DOI:10.1007/s11634-011-0102-y.

M. Templ, A. Kowarik, and B. Meindl. sdcMicro: Statistical DisclosureControl methods for anonymization of microdata and risk estimation, 2014.URL https://github.com/alexkowa/sdcMicro. R package version 4.2.0.

M. van der Loo. The introduction and use of R software at Statistics Nether-lands. In Proceedings of the Third International Conference of Establish-ment Surveys (CD-ROM), Montreal, Canada, 2012. American Statisti-cal Association. URL http://www.amstat.org/meetings/ices/2012/

papers/302187.pdf.


Using R as an alternative teaching tool in the Ecological University of Bucharest Carmen UNGUREANU ([email protected]) Ecological University of Bucharest

ABSTRACT

In a global world universities want to offer the best education to their students so that they can be competitive on the labour market both in the country where they studied and beyond its borders. The Romanian education system - currently undergoing reform - attaches great importance to the use of traditional effi cient teaching tools, along with new alter-native ones. The R data analysis system represents such an alternative method that the Ecological University of Bucharest uses in order to stimulate the student’s creativity in problem solving. Keywords: university, economic education, academic tools, teaching, open source R JEL Classifi cation: A22, A23, I21, I25

1. TERTIARY EDUCATION IN ROMANIA. GENERAL DATA

Radical improvement and diversifi cation of educational offer of the entire system of education and training in Romania is recognized as a priority target of strategic importance and a mandatory condition for pu into practice the principles of sustainable development in the medium and long term. “In the Romanian society there is a wide recognition that the education represents the strategic factor for the future development of the country through its essential contribution to multidimensional modeling and predictive human capital”1. In accordance with acctual specifi c legislation, the Romanian educational system includes:

1 The Regional Development Strategy 2014-2020, Priority Axis 6. Development of human capital, social inclusion growth


- Secondary education and - Tertiary education.

According to the article 23 of the Law of National Education (Law No. 1 of January 5, 2011 published in Offi cial Monitor No. 18 of 10 January 2011) national system of secondary education includes the following levels: a) early education (0-6 years), consisting of the pre-school (0-3 years) and preschool (3-6 years); b) primary education, which includes preparatory class and classes I-IV; c) secondary education, including: - lower secondary education or gymnasium that includes

grades V-IX; - higher secondary education or high school level, including

the high school grades X-XII/XIII; d) professional educationand with duration between six months and two years; e) non-university tertiary education, including post-secondary education. General compulsory education is composed of primary and lower secondary education.

In the 114 of the Education Law it is shown that Tertiary education is organized into universities, academies, institutes, and schools of higher education. The Tertiary education institutions can be state, private or religious. These institutions have legal personality, there are for nonprofi t, public interest and there are apolitical.


Evolution of the number of these institutions in the period 1992-2012 is presented in the following graphic:

Number of tertiary institutionsFig. 1

Source: Romanian National Institute of Statistics

It should be noted that although private universities functioned since 1990 (Ecological University of Bucharest functions from April 1990), they were not identifi ed in offi cial statistics than 1994 because of the legislative void in this fi eld in mentioned period. From the graphic presented one can notice that the number of state tertiary education institutions (56 units - year 2012) is approaching that of private educational institutions (51 units – year 2012) which indicates the need for and this form of private tertiary education in Romania. Regarding of the number of students enrolled in tertiary education in the period 1990-2012 is an increase of 4.7 times in 2007 compared to 1990 and a decrease to half their number in 2012 compared to 2007 (in 2012 recorded 464 592 students enrolled in tertiary education to 907,353 students in 2007). And in the private tertiary education it is noted almost the same trend as that in 2008 the number of students increased by 3.7 times compared to the number registered in 1997 but a drastic decrease in the period 2008-2012 to 75.7% in 2012 reaching the number of students almost to that of students registered in 1997.


The situation analyzed is represented in Graphic 2.

Number of students enrolled in tertiary education, between 1990-2012Fig. 2


Number of graduates in tertiary education followed the same increasing trend in 1990-2007, as the number of students enrolled in the period specifi ed, for that in 2007-2011 the number of those who have gratuates a tertiary education institution to decrease by 41 3%. In private tertiary education the number of students who have graduated an institution of tertiary education decreased by 54.8% in 2011 compared to 2009, which means that not all students enrolled have completed their studies, because they did not have the fi nancial resources necessary for fees or they have found jobs where they did not need a degree or they opened their own business without having a degree, or they have left the country for gains signifi cant than they can gain in their country. The situation above is presented in the graphic below:


Number of graduates in the tertiary education, in period 1990-2011Fig. 3

Source: Romanian National Institute of Statistics Analysing the number of teaching staff in tertiary education in Romania it shows that in the period 1992 - 2007 the number almost doubled (increase in 2007 was 1, 76 times the reference year 1992). From 2007 until 2012 the number of teaching staff decreased by 14% because reducing the number of students from this period but but also because of reaching the age of retirement of teachers, limit stipulated in Education Law no. 1/2011. for the their number to decline from 2007 to in 2012 by 14% due to reduced in the number of students in the same period but due to reaching the retirement age of teachers in the Law of Education no. 1/2011. Number of teachers in private education it doubled between 1995-2007 (the increase was 1.88 times in the analyzed period) while in 2012 their number will decrease by 24% compared with the situation existing in 2007 (due to reducing the number of students in period 2007-2012). This trend is outlined in Graphic 4.


Teaching staff between 1992-2012Fig. 4


Most teachers of Romanian universities looking for new methods and tools for teaching courses and seminars to fulfi ll the needs of current students which increasingly using more and more new technologies. Should be noted that in reaching this target teachers are sometimes confronted with resistance of some students or colleagues. Using the tool R in some universities in Romania is part of the desire of teachers to provide quality educational services to students in their personal development, of the professional insertion and meeting the need for socio-economic competence. In the the following rows we present these universities in Romania which is using the tool R:

University of Bucharest: - Faculty of Sociology and Social Assistance which uses R statusor package for teaching of statistics (applications in R); - Faculty of Mathematics University of Piteşti - Faculty of Mathematics and Computer Science; Technical University of Civil Engineering Bucharest - Faculty of Civil, Industrial and Agricultural Buildings;


Academy of Economic Studies of Bucharest: - Faculty of Cybernetics, Statistics and Informatics Ecological University of Bucharest - Faculty of Economic Siences

Ecological University of Bucharest has sought, since its establishment in 1990, the education and training of students in their personal development, social integration and their active participation in the functioning and development of a sustainable economy. Ecological University of Bucharest – fi rst private educational institution in Romania, after 1990 Founded in the April 4th 1990, Ecological University of Bucharest – it is the only university with environmental profi le from Romania, it numbers today about 30,000 graduates of the following faculties: Faculty of Ecology and Environmental Protection; Faculty of Law and Administrative Sciences; Faculty of Economic Sciences; Faculty of Physical Education and Sports; Faculty of Communication Sciences; Faculty of Management and Environmental Engineering; Faculty of Psychology. In the faculties there are: 11 undergraduate programs; 21 master programs and 25 postgraduate program training and continuing professional development.

“The mission of the Ecological University of Bucharest consists of initial and continuing training of highly qualifi ed specialists for professional activities that are competitive in the labour market, as well as the achievement of the effi cient research and development activities. The university has also the mission to create, exploit and disseminate knowledge through the development of educational and research methods for all members of the University community, so as to ensure an appropriate position in the Romanian and European higher education.”1

In the University, Faculty of Economic Siences, by several young and enthusiastic teachers was introduced the tool R - a useful tool with many advantages over traditional software packages.

1. Self-assessment report 2013, www.ueb.ro


Faculty of Economic Siences – short presentation Undergraduate studies function with two specializations: - Finance and Banking - Business Administration. There are 4 master degrees and 10 postgraduate programs of training and continuing professional development. From the 41 disciplines for each undergraduate program and 14 disciplines in each master’s program, the software R is used at present time, at disciplines: statistics; economic statistics; fi nancial macroeconomics; fi nancial forecasting; econometrics; fi nancial econometrics; capital markets; capital market - institutions and tools.

Applications of R – in courses - graphical representation of distribution and time series; - maps; - statistical indicators (see Annex).

Advantages of R software in University • Promoting open-source software among students; • Exchange information and ideas on R between students from

universities around the world; • Possibility dissemination research results by professors and

students; • Identify opportunities for collaboration between different

universities, joint projects; • Direct involvement of students in the preparation of program

subroutines; • Improves the relationship and communication between students

and teachers outside of the study programms; • Harmonization of curricula the disciplines in the curricula of the

faculties with those of the partner universities; • Creation of performance skills of the students in the labor market; • Cost-free for Academic use; • The existence of two college professors who are part of Team

R-omania: associate professor Nicoleta Caragea and associate professor Antoniade Ciprian Alexandru, who can supervise the work of teachers and students in this fi eld and provide additional help;

• Offers a larger communication and contacts between universities; • Promotes the University and Faculty on the nationally and inter-

nationally plan.


Disadvantages of R software in University • Given it’s a new open-source software, the popularity of the

program has a slow start; • At this moment, R is used in a small scale because it is not yet

promoted at the highest level in all faculties of Ecological University of Bucharest;

• The possibility of using R in few disciplines.

References:

1. Miroiu Maria, Petrehuş Viorel, Zbăganu Gheorghiţă, Iniţiere în R pentru persoane cu pregătire matematică, curs realizat în cadrul proiectului POSDRU/56/1.2/S/32768, “Formarea cadrelor didactice universitare şi a studenţilor în domeniul utilizării unor instrumente moderne de predare-învăţare-evaluare pentru disciplinele matematice, în vederea creării de competenţe performante şi practice pentru piaţa muncii”

2. Wiley, David, 2006. Open source, openness, and higher education. Innovate 3 (1), http://www.innovateonline.info/index.php?view=article&id=354

3. The Regional Development Strategy 2014-2020, Priority Axis 6. Development of human capital, social inclusion growth

4. Law of National Education (Law No. 1 of January 5, 2011 published in Offi cial Monitor No. 18 of 10 January 2011)

5. EUB_Self-assessment report 2013, www.ueb.ro 6. National Institute of Statistics, National Center for Statistical Training

Courses - Introduction to estimation techniques on small fi elds with applications in R/ Introduction to SPSS, from 18 to 22 November 2013. Lecturers: Nicoleta Caragea, Ciprian Alexandru, Ana Maria Dobre - experts in R

7. R for Social Network Analysis, http://www.stanford.edu/~messing/RforSNA.html 8. Institutul Naţional de Statistică, www.insse.ro


Annex

STATISTICAL INDICATORS – STATISTICAL LESSON WITH R

Let it be “Graduates” a database with 80 graduates, their graduate specialization and initial salary. We want to represent the initial salary distribution and to compute some indicators, such as mean and standard deviation.

> head(Graduates) Graduate Specialization Initial.salary1 1 Finanţe 15502 2 Management 13103 3 Management 15754 4 Marketing 16755 5 Contabilitate 15856 6 Marketing 1590

> attach(Graduates)

#graphical representation of Initial salary distribution

> hist(Initial.salary)

#computing the mean of Initial salary> mean(Initial.salary)[1] 1626.688


#computing the standard deviation of Initial salary> sd(Initial.salary)[1] 181.2508

#computing the median of Initial salary> median(Initial.salary)[1] 1610

#computing the geometric average of Initial salary> exp(mean(log(Initial.salary)))[1] 1616.851

International Workshop New Challenges for …...Revista Română de Statistică nr. 2 / 2014 5 The...

Documents

Transcript of International Workshop New Challenges for …...Revista Română de Statistică nr. 2 / 2014 5 The...