Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs...

10
Data Science Approach to Analysis of Lattes CV Data Thiago Lu´ ıs Viana de Santana Applied Computing Graduate Program INPE – Nat. Inst. for Space Research [email protected] Rafael Santos Applied Computing Graduate Program INPE – Nat. Inst. for Space Research [email protected] Abstract The Lattes Platform is an online database of academical records. It is used by the research and educational community of Brazil (and some other countries), be- ing of great value for identification of researchers and their relationships with other researchers and, for that, it can be considered a specialized kind of social net- work. In spite of its usefulness, the main inter- face of access to its data does not allow any type of analysis, just basic reports. In this paper, we present a new tool and approach to analysis of groups of records from the Lattes Platform which is simpler and more flexible than other similar tools proposed in the past. 1 Introduction Evaluation of researchers and students productiv- ity, either individually or in groups, is an important task for universities, research centers, and funding agencies, and also for the students and researchers themselves: students may be interested in know- ing more about the achievements, research areas and experiences of prospective advisors. Teachers and admission deans may also want to know about the academic history of candidates to a graduate program, for example. The usual method to eval- uate academics and students is by analysis of the achievements and publications listed on his or hers curriculum vitae (CV). The Brazilian National Council of Techno- logical and Scientific Development (Conselho Nacional de Desenvolvimento Cient´ ıfico e Tec- nol´ ogico, CNPq) maintains an online system, the Lattes Platform 1 (named in honor of C´ esar Lattes, 1 http://lattes.cnpq.br a Brazilian physicist) that provides an unified in- terface to a database that is used to collect, store and process information about academic achieve- ments. Any researcher can create and maintain his or hers own Lattes CV, using the taxonomy, for- mats and fields defined by the platform. The unifi- cation of some fields and categories makes it easy to fill the forms that feed the database. The Lattes Platform is also used by CNPq to generate reports about the current status of the aca- demic production of the researchers and students, and to evaluate applications to several different types of grants. The data on the platform is also used by other government funding agencies and by the Ministry of Education, for evaluation of the production of professors and students in graduate programs. The Lattes Platform public interface is a web- based system that allows the edition of the CVs by its owner and the search and retrieval of the CVs by anyone that knows either the researchers’ names or IDs (a 16-digit unique identifier). In the end of 2016 there were more than 3.500.000 CVs stored in the database 2 . Of those, almost 1.500.000 were CVs of students. Data in the Lattes Platform can also be used for other academic purposes: analysis of academic in- dicators’ evolution (Perez-Cervantes et al., 2012), identification of communities based on similar interests or collaborations (Mena-Chalco et al., 2014; Ara´ ujo et al., 2014; Alves et al., 2011b), changes of research areas based on the publica- tions records, etc. Analysis considering groups of researchers or students can be done in differ- ent scales, from the whole academic community to small groups, such as researchers in a group or students in a college. But although the Lattes CV data is considered public (being provided by the 2 http://estatico.cnpq.br/painelLattes/ 168

Transcript of Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs...

Page 1: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

Data Science Approach to Analysis of Lattes CV Data

Thiago Luıs Viana de SantanaApplied Computing Graduate ProgramINPE – Nat. Inst. for Space [email protected]

Rafael SantosApplied Computing Graduate ProgramINPE – Nat. Inst. for Space Research

[email protected]

AbstractThe Lattes Platform is an online databaseof academical records. It is used bythe research and educational communityof Brazil (and some other countries), be-ing of great value for identification ofresearchers and their relationships withother researchers and, for that, it can beconsidered a specialized kind of social net-work.

In spite of its usefulness, the main inter-face of access to its data does not allow anytype of analysis, just basic reports. In thispaper, we present a new tool and approachto analysis of groups of records from theLattes Platform which is simpler and moreflexible than other similar tools proposedin the past.

1 IntroductionEvaluation of researchers and students productiv-ity, either individually or in groups, is an importanttask for universities, research centers, and fundingagencies, and also for the students and researchersthemselves: students may be interested in know-ing more about the achievements, research areasand experiences of prospective advisors. Teachersand admission deans may also want to know aboutthe academic history of candidates to a graduateprogram, for example. The usual method to eval-uate academics and students is by analysis of theachievements and publications listed on his or herscurriculum vitae (CV).

The Brazilian National Council of Techno-logical and Scientific Development (ConselhoNacional de Desenvolvimento Cientıfico e Tec-nologico, CNPq) maintains an online system, theLattes Platform1 (named in honor of Cesar Lattes,

1http://lattes.cnpq.br

a Brazilian physicist) that provides an unified in-terface to a database that is used to collect, storeand process information about academic achieve-ments. Any researcher can create and maintain hisor hers own Lattes CV, using the taxonomy, for-mats and fields defined by the platform. The unifi-cation of some fields and categories makes it easyto fill the forms that feed the database.

The Lattes Platform is also used by CNPq togenerate reports about the current status of the aca-demic production of the researchers and students,and to evaluate applications to several differenttypes of grants. The data on the platform is alsoused by other government funding agencies andby the Ministry of Education, for evaluation of theproduction of professors and students in graduateprograms.

The Lattes Platform public interface is a web-based system that allows the edition of the CVsby its owner and the search and retrieval of theCVs by anyone that knows either the researchers’names or IDs (a 16-digit unique identifier). Inthe end of 2016 there were more than 3.500.000CVs stored in the database2. Of those, almost1.500.000 were CVs of students.

Data in the Lattes Platform can also be used forother academic purposes: analysis of academic in-dicators’ evolution (Perez-Cervantes et al., 2012),identification of communities based on similarinterests or collaborations (Mena-Chalco et al.,2014; Araujo et al., 2014; Alves et al., 2011b),changes of research areas based on the publica-tions records, etc. Analysis considering groupsof researchers or students can be done in differ-ent scales, from the whole academic communityto small groups, such as researchers in a group orstudents in a college. But although the Lattes CVdata is considered public (being provided by the

2http://estatico.cnpq.br/painelLattes/

168

Page 2: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

researchers and students themselves), retrieval ofthe data is limited: it is possible to download a fullindividual CV as a XML file with all the data en-tered on that CV, but it is not possible to retrievesubsets of the data for more than one CV at a time.This makes it hard to perform some specific typesof analysis that requires the extraction of certaincategories from several CVs at once.

In this paper we present our work on tools andtechniques that allow the extraction and analysisof data from collections of Lattes CVs. One char-acteristic of these techniques is that it considersthat extraction of data from a set of Lattes CVsas a part of a data science process (Schutt andO’Neil, 2013): raw data (the Lattes CV XMLs) iscollected, processed and cleaned; allowing an an-alyst to use exploratory data analysis techniquesand apply statistical or other models on it. The an-alyst has access to the data and the tools to processit in a simple but flexible environment, thereforehe or she isn’t limited to a packed set of tools. Re-sults of the analysis are presented in charts, plots,reports and data products, derived from the set ofLattes CVs considered on the analysis, and whichmay be used as input to other analysis tasks.

This paper is divided as follows: Section 2presents related work, mainly on other tools usedto extract information from Lattes CVs. Section 3sets requirements for a Lattes CV exploration tooland proposes a data science based approach tobuild this tool. In section 4, the Lattes CV explo-ration tool is used and a few data Exploratory DataAnalysis visualization examples are exhibited. Fi-nally, section 5 enumerates problems outside theExploratory Data Analysis framework that can besolved by extending the LattesLab tool.

2 Related Work

Access to subsets of data on Lattes CVs is desir-able for several different types of analysis, and of-ten these analyses must be done considering col-lections and not individual CVs. Considering thisneed, several different tools were created in thepast to process and analyze data from the LattesPlatform.

One of the first tools that allowed the extractionof information from a collection of Lattes CVs isscriptLattes (Mena-Chalco et al., 2009). This toolallowed the extraction of data from the Lattes CVsof groups of researchers, creating reports, maps,graphs and other information from the collections

of CVs. The tool could be deployed as a local ap-plication on computers running Linux, and its au-thors made the tool open so other groups could useit. Other research groups used scriptLattes as basisto create different types of analyses (Alves et al.,2016; Perez-Cervantes et al., 2012).

Initially scriptLattes downloaded the data fromCNPq’s servers as HTML files, but with lateradoption of a CAPTCHA (”Completely Auto-mated Public Turing test to tell Computers andHumans Apart”) access control system, automaticdownload was made very difficult, so it was mod-ified to use a local set of files that must be down-loaded in advance.

scriptLattes is probably the most referenced toolin the bibliography we surveyed. It is quite com-plete, but the data is parsed from the HTML ver-sion of the Lattes CVs, which has changed in thepast and may change in the future. Reports andgraphics are also preprogrammed, so extensionsand different layouts must be programmed sepa-rately.

LattesExtractor3 is a tool developed by CNPqthat allows the download of several Lattes XMLCVs in batches. Although it seems to solve part ofthe problem at hand, namely, how to obtain sub-sets of the data, it is not as open or as flexible: onlyregistered organizations can retrieve data with thistool, and organizations can only access data that isrelated to the organization itself. For example, anuniversity may be able to download all the XMLfiles with the CVs of its staff, teachers and stu-dents, but will not be able to download CVs ofcollaborators that don’t currently work or study atthe university. Similarly, when a student graduatesand leaves the university, its CV will no longer beavailable after graduation (since he or she is notpart of the university). This tool also does notperform any kind of analysis, providing only theXML files.

SUCUPIRA (Alves et al., 2011b) was de-veloped as a tool that allowed both the semi-automatic extraction of the XML files from theLattes Platform and the creation of reports andgraphics that could answer questions about col-laboration between researchers, their geographicallocation, their scientific production and its evolu-tion, etc. SUCUPIRA is a web-based applicationthat uses a list of names of researchers or students,managed by the system’s user, to download the

3http://lattesextrator.cnpq.br/lattesextrator/

169

Page 3: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

Lattes CVs (as HTML files), parse those and cre-ate reports based on the data extracted from theCVs on that list.

It seems that the development of that tool wasdiscontinued, but changes on the Lattes Platformmay have made it unusable: the structure of theHTML files changed, therefore parsers that couldparse a version of the HTML generated by theplatform had their usefulness restricted due to thenew layout. Additionally, SUCUPIRA was writ-ten when access to the Lattes CVs was unhinderedby CAPTCHAS – for some time the platform usedsimple CAPTCHAS that could be solved withtools such as Tesseract (Kay, 2007), but recentlythe CAPTCHAs were made more difficult to solveautomatically.

SUCUPIRA used another tool developed bythe same researchers: LattesMiner (Alves et al.,2011a). LattesMiner is a Domain-Specific Lan-guage (DSL) implemented as a set of Java classesthat allows the manipulation of data on a set ofLattes CVs, defined by a programmer, with mod-ules for data discovery (association of names andIDs), data extraction (parsing of the HTML filescorresponding to the CVs with regular expres-sions), storage of data in a local database, vi-sualization and analysis tools. As with SUCU-PIRA, LattesMiner’s development has stopped,since changes on the Lattes platform renderedsome aspects of the tool unusable.

Another tool that is concerned with processingdata from the Lattes Platform is XMLattes (Fer-nandes et al., 2011), which converts the LattesCVs in HTML to XML for further processing, andwhich was rendered unnecessary since the presentversion of the Lattes platform already exports theXML version of the curriculum vitae (although re-quiring CAPTCHAs for download of individualCVs).

As part of this research we reviewed several pa-pers related to analysis of Lattes CVs data (Di-giampietri et al., 2012; Mena-Chalco et al., 2014;Araujo et al., 2014; Perez-Cervantes et al., 2012).Most of these papers used a database with de-tailed information on academics, which was ex-tracted from the Lattes Platform when it was pos-sible to do so without the limitations imposed bythe CAPTCHA currently in use. There were noreferences on whether that database was kept upto date.

3 A Data Science-based Approach toAnalysis of Lattes CV Data

Up to now, a number of Lattes CV exploring toolshave been listed. Some of them cannot be used asdesigned for the following reasons:

• The Lattes CV platform has updated its datapublishing technology from HTML to XML.Therefore, all the tools that relied on that pre-vious file distribution (often requiring com-plex procedures to parse HTML) no longerwork properly. It seems a trivial technicalissue, but the tools that extracted informa-tion from the Lattes CVs formatted as HTMLdocuments had to deal with complex HTMLstructures and had to detect, from the HTMLcontent itself, the categories of informationbeing extracted (e.g. articles in journals, con-ferences, titles, authors, etc.) while the datarepresented as XML is properly formattedand tagged with this information. Therefore,even though the XML platform restricted theuse of the HTML-based tools, it providedstructural elements for development of new,more robust tools.

• Some of these tools were designed to au-tomatically download a list of CVs fromCNPq’s site. This is not possible today dueto the implementation of the CAPTCHA test,both to view and download the Lattes CVs.

• Some of the solutions are only available onspecific platforms - such as Linux - and re-quire a specific setup before use.

Considering the present status of the existingtools for Lattes CV analysis we consider that anew, functioning tool to extract, interpret, analyzeand visualize the data in a simple but flexible wayought to comply with the following requirements:

1. Work with an offline set of Lattes CVs.Due to the current impossibility of automatedbatch download of Lattes CVs, the CVs musteither be obtained manually or automaticallythrough other authorized tools – such as Lat-tesExtrator.

2. Be able to transparently transform the list ofLattes CVs’ XML files into table-like datastructures for further processing. To makethis transformation, it is necessary to know

170

Page 4: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

the structure of the Lattes CV XML, identifythe parameters of interest to the researcherand migrate these parameters to the datastructures. This transformation is made eas-ier since CNPq publishes the XML SchemaDictionary (XSD) of the Lattes CV files.

3. Be agnostic with respect to the operating sys-tem used to run the tool. Each potential userhas a limited number of resources and to re-quire the user to learn a new programminglanguage, install a new software or even anew whole operating system shall be avoidedif the tool is to be widely used.

4. Require no specialized knowledge for opera-tion. In the same way that requiring an spe-cific environment limits the utilization of thetool, if specialized knowledge is not requiredto operate this tool, it will be easier to use,and for that, may be used by a bigger num-ber of individuals. At the same time the toolmust be extensible so more advanced userscould do more with the tool.

5. The results produced by this tool should bereproducible by any user interested in anal-ysis of a set of Lattes CVs. Reproducibilityis ensured by the use of a common set of in-structions that can be easily shared and buildupon.

The first three requirements on that list arestrictly technical, and must be met by a Lattesanalysis tool to deal with the complexity of theLattes CVs’ data access and representation issues.More important are the fourth and fifth require-ments, that ensure that such a tool can be extendedfor different ways to explore the data and that theresults can be reproduced and shared.

By designing a tool that follows these require-ments it is possible to apply a data science processto the problem of analysis of collections of LattesCVs. The data science process (Schutt and O’Neil,2013) is shown in Figure 1.

The requirements listed above are directly re-lated to the Data Science Process: the first require-ment makes reference to the collection of data,the second requirement is related to the process-ing and cleaning of data as well as storage of the“clean” data, so that it can be easily and trans-parently accessed. Even though the fourth andfifth requirements are not directly associated to

Figure 1: Data Science Process (Schutt andO’Neil, 2013).

the steps shown in the Data Science process (Fig-ure 1), they serve as a guideline to ensure that theresults attained by the tool are easy to acquire –and customize – and also reproducible.

The fourth requirement listed on this section isdirectly related to the concept of Exploratory DataAnalysis (EDA), which intent is to allow the re-searcher to discover patterns on data by using vi-sualization tools and statistics to understand “whatis going on with this data”.

Analysis of Lattes CVs can be done in differentways, using different metrics and algorithms, butif we consider that most of the analyses will bedone considering thematic groups of researchers(e.g. researchers in a specific area of knowledge,or professors and students of a specific depart-ment) it becomes clear that methods and tech-niques applied to a particular analysis can be usedinto different contexts, depending on the group. Atool for analysis of Lattes CVs collections mustmake easy the reproduction of its results – repro-ducible research is also a concept closely linked toData Science.

According to (Peng, 2011), there is a spectrumof reproducibility of research, that goes from anon-reproducible result to a fully reproducible one(Figure 2).

Figure 2: Reproducibility spectrum of a scientificpublication (Peng, 2011).

Considering Figure 2, it is highly desirable thata tool performing Lattes CV Analysis allows the

171

Page 5: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

full reproduction of the data analysis experimentswith the corresponding code being publishableand applicable to different datasets of the same na-ture.

3.1 LattesLabIn order to address the requirements listed on theprevious section, we propose a software stack so-lution, based on concepts and principles of DataScience, to tackle the generic problem of analyz-ing Lattes CVs. Our tool, named LattesLab, isbased on the following components:

• A library that is able to scan a collection ofLattes CVs (stored as local files) and create aset of data frames (table-like structures) fromthat collection.

• A deployment mechanism for that librarythat allows its use, with minimal software in-stallation requirements.

• A set of live documents that shows how toperform basic statistical analysis, visualiza-tions and reports.

LattesLab is being developed in the Python lan-guage’s (VanRossum and Drake, 2010), widelyused platform – taking into account its large de-veloper and user community. It is also one of themost used programming languages to solve DataScience problems due do the large amount of freeand open analysis and visualization libraries.

Other languages were considered for implemen-tation – a previous version was developed in Java,but we found out that the distribution of the libraryto use in other derived projects was too complex.R (Ihaka and Gentleman, 1996) was also consid-ered: even though the R language is strongly sup-ported by the community and is heavily used bythe scientific community, Python was chosen forthe readability of its code (over R code, at least)and its gradual learning curve.

To use LattesLab, it is necessary to have all theLattes CV files of interest stored in a folder andpass the folder name as a variable to the Lattes-Lab main library. Then, the tool reads the CV filesas downloaded from the Lattes Platform, parsingthe XMLs and storing the data available on dataframes.

To extract data from the Lattes CV, XPath (partof XLST, a language for transforming XML docu-ments) was used. One of the major advantages of

the Lattes CV is its XML structure (which allowsthe extraction of semantic information from it) andthe fact that the way its structure is posed is avail-able in a public XML Schema Definition. By ac-cessing the XML file through the XPath languageone can extract the desired information and use itaccordingly. In the presented case, the extractedinformation is used to generate a data frame whichwill the be used to perform a few basic analysis.

To obtain meaningful results from the analy-sis, it is desirable that the set of Lattes CVs sharesome characteristics. For example, to perform ananalysis of the students and researchers of one in-stitution, it is necessary to have the Lattes CVsof members of that institution stored and subjectto analysis by LattesLab. Other sets of thematiccollections could be researchers of all institutionssharing some CNPq classification (e.g. CNPqgrantees), or researchers that stated that they workin a specific knowledge area.

The LattesLab main library is packed as aPython package, which simplifies its deployment.It can be downloaded and used in a standaloneway, in this case the user must be able to at leastinstall a Python IDE and then install and importthe package.

Another deployment solution which is morestraightforward is to deploy LattesLab as a Jupyternotebook (Kluyver et al., 2016). Jupyter note-books are web applications that allow creation andsharing of documents that contain live code, equa-tions, visualizations and explanatory text. Thenon-static parts of these documents are created bythe execution of Python code.

Notebooks are an interesting solution not onlyto make access to the information easier, but alsodue to the fact that it runs in any operating systemwith a browser installed, and that it works not onlywith Python, but with R, Scala, Julia, and over 40programming languages.

Jupyter notebooks allows not only the deploy-ment of code, but, in the same document, format-ted text (with different styles and allowing the useof hypertext, graphics, etc.) related to that code.Therefore, by using a Jupyter notebook, it is pos-sible to run LattesLab code, to perform analysesand visualizations, to explain what was done andto give instructions to the users, so that he or shedoesn’t have to know the tool beforehand to useit and can change parameters or modify analysisto suit specific needs. Figure 3 shows a Jupyter

172

Page 6: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

notebook running LattesLab.

Figure 3: Example of a Jupyter Notebook windowrunning the LattesLab tool.

When considering how to develop and deploy atool for analysis of Lattes CV data we could optfor a standalone, GUI-based application. The rea-son to work on a tool with interactive lines of codeis due to the flexibility that such a tool provides.Consider a simple analysis that requires the selec-tion of a date range on a set of CVs: a GUI mustprovide a widget to allow the input of an initial andend year, which is quite simple to implement anduse, and a programming approach would requireone or two lines of code to implement the samefunctionality. But for more complex filters, e.g. toselect a non-continuous range of dates, a GUI dia-log would be more complex for the user (probablyimplemented as a list of checkboxes, one for eachyear) than one or two lines of code that filter a dataset by a list of years.

A programming environment, while more com-plex, give more freedom to the user to implementfilters, apply visual effects on graphics and usethird-party tools, but the most important reasonto avoid a GUI-based approach is the easiness ofreproducibility: the chain of commands that pro-vide an analysis from a dataset can be expressed incode, which can be documented and read by users,while GUI-based applications would require ges-tures (clicks, scrolls, inputs) that must be pre-served somehow to allow reproduction of the anal-ysis.

4 Examples of Analysis Reports

For the following examples, a group of 876 Lat-tes CVs was downloaded from the Lattes Platform.These CVs belong to participants in the ScientificInitiation grant program at the Brazilian Instituteof Space Research (INPE) – these grants are givento undergraduate students to participate in researchand development at the institute. The LattesLab li-brary scanned and parsed these CVs’ files to createa data frame used in the analysis and examples inthis section.

Lattes CVs are created and updated by the users,and some may stop updating theirs, specially ifthey are not involved in academic environmentsanymore. A basic but interesting question wecould ask about our data is: how old are the CVs,i.e. what is the last time they were updated?

We used LattesLab to create a histogram of theage of the CVs (the “last updated” information ispresent on the CVs).

Figure 4: Age of Lattes CV in months.

In order to give an idea of the simplicity of usingJupyter notebooks and the LattesLab library, theplot in Figure 4 was created by three lines of code,once the data frame is created and loaded (to im-port data from the Lattes CVs and create the dataframe, approximately two hundred lines of codewere used, but these are not shown to users).

Scientific Initiation grants are given for a periodof 12 months. It is possible for an undergradu-ate student to reapply for a grant, as long as he orshe is enrolled in an undergraduate program. Weknew that some of the students had held more thanone grant, but wanted to get some statistics on it.A simple histogram was created, and it is shown

173

Page 7: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

in Figure 5. Surprisingly, there were students thatheld grants for five and six years – that was unex-pected since the average of the duration of under-graduate technical courses in Brazil is five years.

Figure 5: Number of Scientific Research Grantsper Student.

The plot above was created with five lines ofcode in a Jupyter notebook.

With that thematic set of Lattes CVs data itis possible to analyze the academic achievementsand get a glimpse on the careers of the studentsthat held Scientific Initiation grants. How many ofthose decided on an academic career after their un-dergraduate studies? Figure 6 shows, of the LattesCVs on our data set, how many individuals werepart of different academic activities and achievedwhich academic degrees. An individual can bepart of more than one graph bar, so individualswith Post-doctorate degrees are also found in thePhD degrees bar.

Many different types of visualizations can beeasily achieved using LattesLab: Figure 7 showshow many of the students in our thematic data setobtained his/hers Masters’ degree per year. Ofcourse the interpretation of the results of thesegraphics is heavily dependent on the environmentwhere the data has been collected: it could bepossible to infer from Figure 7 that the numberof degrees awarded is increasing over the years,or that there are in general more degrees awardedin even years, but the data itself does not answerwhy this is happening. It must be pointed outthat we’re only showing some examples of analy-sis/visualizations as examples of Exploratory DataAnalysis that can be achieved with LattesLab.

Data frames generated with the LattesLab li-

Figure 6: Academic Level of the Researchers.

Figure 7: Master Degrees obtained per year.

brary also contain statistics on publications as de-clared in the Lattes CVs. Publications counts arestored by type and year of publication. One in-dicator of interest would be the evolution of thenumber of papers of a given researcher, or the en-tire researcher group, over the years.

We could consider that there are different pro-files for Scientific Initiation grantees, dependingon whether the grantee was able to publish his/herwork some time after receiving the grant, and de-pending on the number of papers published peryear.

In order to perform this type of analysis, a fea-ture vector – that counted the number of paperspublished in conferences and journals each yearafter the student received his or hers first grant –

174

Page 8: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

was used. This is one of the many possible ways toanalyze individual and group publication indexes,and can be easily extracted from the data frameobtained from the LattesLab library.

We know that some Scientific Initiationgrantees did not pursued further academic activ-ities – it is to be expected that these students didnot publish their results, or, if they did, did that oneor two years after being awarded the grant. This isone profile we expect from our data and featurevector – are there others? What are the most inter-esting or unexpected profiles?

We used the Fuzzy C-Means clustering algo-rithm (Bezdek et al., 1984) to group the 876 pro-files extracted from the grantees CVs into nine dif-ferent groups (there are metrics that can be used toindicate the best number of groups for clustering adata set, but these metrics are sometimes conflict-ing and inconclusive (Morais et al., 2015)).

The centroids obtained from the profiles areshown in Figure 8. That figure shows some in-teresting patterns – one, already expected, showsthat the grantees did not published any paper be-tween being awarded the grant and 12 years afterthe award. Other patterns are also interesting: theclustering algorithm identified six patterns wherethe grantee did not publish on the first year (alsoexpected), published one or more papers on thesecond year and fewer and fewer on subsequentyears.

Figure 8: Model with all data classified in ninecentroids.

Some of the patterns shown in Figure 8 indi-cates that the grantees kept publishing for yearsafter being awarded the grant.

It is possible to see a large cluster correspond-ing to grantees that had no publications over the

years consider in this example – the one patternwe know that that was in our data. To keep withthe Exploratory Data Analysis approach we elim-inated from the data frame (again, a simple op-eration since the Lattes data was converted to adata frame) those grantees which have never pub-lished, and ran the Fuzzy C-Means algorithm witheight clusters. Resulting centroids are shown inFigure 9.

Figure 9: Model with all non-zero data classifiedin eight centroids.

As expected, profiles (centroids) in Figures 8and 9 are very similar, indicating that the removalof the profile with zero publications did not changethe clustering results much. Further analysis couldbe performed to better characterize the remain-ing profiles, or to investigate different profiles thatmay appear when we consider only publicationsafter four years of being awarded the grant.

5 Conclusion and Future WorkAs shown in the previous section, the LattesLablibrary has the capability to generate a data framecontaining most of the metadata and quantitativedata (counts for categories and years) of a local,thematic collection of Lattes CVs. Different typesof analysis can be easily done when combining thelibrary with other Python libraries in a standaloneapplication or Jupyter notebook. Other interestingreports and analysis that can be done with this kindof data are:

• List CVs from the local thematic collectionthat must be downloaded again (based on theage of the CV).

• Cluster CVs by different criteria usingdifferent methods more suited to Exploratory

175

Page 9: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

Data Analysis, such as the KohonenSelf-Organizing Networks for visualiza-tion (Morais et al., 2015).

• Generate a histogram of all the publications,per category and year, of the researchers on aspecific group.

• Based on the previous task, create simula-tions that include or exclude certain membersof the groups, to evaluate what if scenarios ofresearchers leaving groups or departments.

• Again based on the previous tasks, comparetwo subsets of Lattes CVs by yearly produc-tion, averaged by the number of researchersin each group, for evaluation of publicationsbetween departments or universities.

It must be pointed out that these are actual re-quests from some coordinators of graduate pro-grams and head of departments that are acting asbeta testers/evaluators of the tool.

Lattes CVs can be considered social networksin the sense that co-publications, co-orientationsand participation in events as organizers or incommittees can be extracted from the data, sincecoauthors are listed and sometimes identified bya unique ID used in the Lattes database. Futureversions of the library will have the capabilityof extracting a co-occurrence matrix of IDs thatidentify categories and times of collaborations be-tween members of a group. This could be usedto explore the social network aspect of the LattesCVs (to find cliques, temporal changes betweengroups, etc.)

Another improvement being considered is thecreation of another type of data frame that repre-sents the textual information associated with eachresearcher publication – papers titles, names ofconferences, etc. This could be used to identifyareas of interest and keywords through text min-ing techinques, making it possible to explore otherways to consider similarity between researchers.

ReferencesAlexandre D Alves, Horacio H Yanasse, and Nei Y

Soma. 2011a. Lattesminer: a multilingual dsl forinformation extraction from lattes platform. InProceedings of the compilation of the co-locatedworkshops on DSM’11, TMC’11, AGERE! 2011,AOOPES’11, NEAT’11, & VMIL’11. ACM, pages85–92.

Alexandre Donizeti Alves, Horacio Hideki Yanasse,and Nei Yoshihiro Soma. 2011b. Sucupira: a systemfor information extraction of the lattes platform toidentify academic social networks. In InformationSystems and Technologies (CISTI), 2011 6th IberianConference on. IEEE, pages 1–6.

Wonder AL Alves, Saulo D Santos, and Pedro HTSchimit. 2016. Hierarchical clustering based on re-ports generated by scriptlattes. In IFIP InternationalConference on Advances in Production ManagementSystems. Springer, pages 28–35.

Eduardo B Araujo, Andre A Moreira, Vasco Furtado,Tarcisio HC Pequeno, and Jose S Andrade Jr. 2014.Collaboration networks from a large cv database:dynamics, topology and bonus impact. PloS one9(3):e90537.

James C Bezdek, Robert Ehrlich, and William Full.1984. Fcm: The fuzzy c-means clustering algo-rithm. Computers & Geosciences 10(2-3):191–203.

L Digiampietri, J Mena-Chalco, J de Jesus Perez-Alcazar, Esteban F Tuesta, K Delgado, and RogerioMugnaini. 2012. Minerando e caracterizando dadosde currıculos lattes. In Brazilian Workshop on So-cial Network Analysis and Mining (BraSNAM).

Gustavo de O Fernandes, Jonice de O Sampaio, andJM Souza. 2011. Xmlattes a tool for importing andexporting curricula data. In International Confer-ence on Information and Knowledge Engineering.

Ross Ihaka and Robert Gentleman. 1996. R: a lan-guage for data analysis and graphics. Journal ofcomputational and graphical statistics 5(3):299–314.

Anthony Kay. 2007. Tesseract: an open-source op-tical character recognition engine. Linux Journal2007(159):2.

Thomas Kluyver, Benjamin Ragan-Kelley, Fer-nando Perez, Brian Granger, Matthias Bussonnier,Jonathan Frederic, Kyle Kelley, Jessica Hamrick,Jason Grout, Sylvain Corlay, et al. 2016. Jupyternotebooks—a publishing format for reproduciblecomputational workflows. Positioning and Power inAcademic Publishing: Players, Agents and Agendaspage 87.

Jesus Pascual Mena-Chalco, Luciano AntonioDigiampietri, Fabrıcio Martins Lopes, andRoberto Marcondes Cesar. 2014. Brazilianbibliometric coauthorship networks. Journal of theAssociation for Information Science and Technology65(7):1424–1445.

Jesus Pascual Mena-Chalco, Cesar Junior, and RobertoMarcondes. 2009. Scriptlattes: an open-sourceknowledge extraction system from the lattes plat-form. Journal of the Brazilian Computer Society15(4):31–39.

176

Page 10: Data Science Approach to Analysis of Lattes CV Dataceur-ws.org/Vol-2029/paper16.pdf · Lattes CVs considered on the analysis, and which may be used as input to other analysis tasks.

Alessandra Marli M Morais, Rafael DC Santos, andM Jordan Raddick. 2015. Visualization of citizenscience volunteers’ behaviors with data from us-age logs. Computing in Science & Engineering17(4):42–50.

R. D. Peng. 2011. Reproducible research in com-putational science. Science 334(6060):1226–1227.https://doi.org/10.1126/science.1213847.

Evelyn Perez-Cervantes, Jesus P Mena-Chalco, andRoberto M Cesar. 2012. Towards a quantitativeacademic internationalization assessment of brazil-ian research groups. In E-Science (e-Science), 2012IEEE 8th International Conference on. IEEE, pages1–8.

Rachel Schutt and Cathy O’Neil. 2013. Doing datascience: Straight talk from the frontline. O’ReillyMedia, Inc.

Guido VanRossum and Fred L Drake. 2010. Thepython language reference. Python Software Foun-dation Amsterdam, Netherlands.

177