The Lattes platform has been hosting curricula of Brazilian researchers since the late 1990s, containing more than 5 million curricula. The data from the Lattes curricula can be downloaded to XML format, the complexity of this reading process motivated the development of the getLattes package, which imports the information from the XML files to a list in the R software and then tabulates the Lattes data to a data.frame.

The main information contained in XML files, and imported via getLattes, are:

  • Academics Papers Presentation getApresentacaoTrabalho
  • Research Area getAreasAtuacao
  • Published Papers getArtigosPublicados
  • Profissional Links getAtuacoesProfissionais
  • Ph.D. Examination Board’s getBancasDoutorado
  • Undergraduate Examination Board’s getBancasGraduacao
  • General Examination Board’s getBancasJulgadoras
  • Master Examination Board’s getBancasMestrado
  • Books Chapters getCapitulosLivros
  • Short Duration Course getCursoCurtaDuracao
  • General Data getDadosGerais
  • Profissional Address getEnderecoProfissional
  • Events and Congresses getEventosCongressos
  • Profissional Formation getFormacao
  • Languages getIdiomas
  • Newspapers and Magazines getJornaisRevistas
  • Research Lines getLinhaPesquisa
  • Published Books getLivrosPublicados
  • Event’s Organization getOrganizacaoEvento
  • Academic Advisory (Ph.D. Thesis) getOrientacoesDoutorado
  • Academic Advisory (Master Thesis) getOrientacoesMestrado
  • Academic Advisory (Other) getOrientacoesOutras
  • Academic Advisory (Post Doctorate) getOrientacoesPosDoutorado
  • Other Bibliographic Productions getOutrasProducoesBibliograficas
  • Other Technical Productions getOutrasProducoesTecnicas
  • Participation in Projects getParticipacaoProjeto
  • Preface getPrefacio
  • Awards and Medals getPremiosTitulos
  • Technical Production getProducaoTecnica
  • TV and Radio Program getProgramaRadioTV
  • Research Report getRelatorioPesquisa
  • Works in Event getTrabalhosEventos

From the functionalities presented in this package, the main challenge to work with the Lattes curriculum data is now to download the data, as there are Captchas. To download a lot of curricula I suggest the use of Captchas Negated by Python reQuests - CNPQ. The second barrier to be overcome is the management and processing of a large volume of data, the whole Lattes platform in XML files totals over 200 GB. In this tutorial we will focus on the getLattes package features, being the reader responsible for download and manage the files.

Follow an example of how to search and download data from the Lattes website.

Installation

To install the released version of getLattes from github.

# install and load devtools from CRAN
install.packages("devtools")
library(devtools)

# install and load getLattes
devtools::install_github("roneyfraga/getLattes")

Load getLattes.

library(getLattes)

# support packages
library(dplyr)
library(tibble)
library(pipeR)

Extract XML from zip

The variable to merge the table extract from any get function is the id variable. The id is 16 digits, unique to each curriculum. However, it is important to rename the .xml file from curriculo.xml to [16 digits id].xml. As Lattes has many versions of XML structure, the more consistent way to extract id is from the file name.

# the zip file(s) (is)are stored in datatest/
unzipLattes(filezip='2854855744345507.zip', path='datatest/')
unzipLattes(filezip='*.zip', path='datatest/')

# the zip files are stored in the working directory
unzipLattes(filezip='*.zip')

Import XML curriculum data

# the file 4984859173592703.xml 
cl <- readLattes(filexml='4984859173592703.xml', path='datatest/')

# import several files
cls <- readLattes(filexml='*.xml$', path='datatest/')

# import xml files from working directory
cls <- readLattes(filexml='*.xml$')

As example, 500 random curricula data xmlsLattes imported as an R list.

data(xmlsLattes)
length(xmlsLattes)
#> [1] 500
names(xmlsLattes[[1]])
#> [1] "DADOS-GERAIS"         "DADOS-COMPLEMENTARES" ".attrs"              
#> [4] "id"

get functions

To read data from only one curriculum any function get can be executed singly, to import data from two or more curricula is easier to use get functions with lapply.

# to import from one curriculum 
getDadosGerais(xmlsLattes[[499]])
#>   nome.completo nome.em.citacoes.bibliograficas nacionalidade
#> 1    Rui Pretto                      PRETTO, R.             B
#>   pais.de.nascimento uf.nascimento cidade.nascimento permissao.de.divulgacao
#> 1             Brasil            RS       SANTA MARIA                     NAO
#>   outras.informacoes.relevantes               id data.atualizacao
#> 1                               0001028411019168         07122004

Import general data from 500 curricula. The output is a list of data frames, converted by a unique data frame with bind_rows.

lt <- lapply(xmlsLattes, getDadosGerais)
lt <- bind_rows(lt)
glimpse(lt)
#> Rows: 499
#> Columns: 14
#> $ nome.completo                   <chr> "Fernando Migliorini Tenório", "Maria…
#> $ nome.em.citacoes.bibliograficas <chr> "TENÓRIO, F. M.", "SILVA, M. C. J.", …
#> $ nacionalidade                   <chr> "B", "B", "B", "B", "B", "B", "B", "B…
#> $ pais.de.nascimento              <chr> "Brasil", "Brasil", "Brasil", "Brasil…
#> $ uf.nascimento                   <chr> "PR", "BA", "", "BA", "AL", "AL", "MG…
#> $ cidade.nascimento               <chr> "ARAPONGAS", "EUNAPOLIS", "", "Salvad…
#> $ permissao.de.divulgacao         <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "N…
#> $ id                              <chr> "0000000403000690", "0000002580990139…
#> $ data.atualizacao                <chr> "29022008", "11102009", "28012015", "…
#> $ data.falecimento                <chr> NA, "", "", "", "", NA, "", "", "", "…
#> $ sigla.pais.nacionalidade        <chr> NA, NA, "BRA", NA, "BRA", NA, "BRA", …
#> $ pais.de.nacionalidade           <chr> NA, NA, "Brasil", NA, "Brasil", NA, "…
#> $ outras.informacoes.relevantes   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ texto.resumo.cv.rh              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

To write quickly, I will use pipe %>>% from pipeR package.

lapply(xmlsLattes, getDadosGerais) %>>%
    bind_rows %>>%
    glimpse
#> Rows: 499
#> Columns: 14
#> $ nome.completo                   <chr> "Fernando Migliorini Tenório", "Maria…
#> $ nome.em.citacoes.bibliograficas <chr> "TENÓRIO, F. M.", "SILVA, M. C. J.", …
#> $ nacionalidade                   <chr> "B", "B", "B", "B", "B", "B", "B", "B…
#> $ pais.de.nascimento              <chr> "Brasil", "Brasil", "Brasil", "Brasil…
#> $ uf.nascimento                   <chr> "PR", "BA", "", "BA", "AL", "AL", "MG…
#> $ cidade.nascimento               <chr> "ARAPONGAS", "EUNAPOLIS", "", "Salvad…
#> $ permissao.de.divulgacao         <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "N…
#> $ id                              <chr> "0000000403000690", "0000002580990139…
#> $ data.atualizacao                <chr> "29022008", "11102009", "28012015", "…
#> $ data.falecimento                <chr> NA, "", "", "", "", NA, "", "", "", "…
#> $ sigla.pais.nacionalidade        <chr> NA, NA, "BRA", NA, "BRA", NA, "BRA", …
#> $ pais.de.nacionalidade           <chr> NA, NA, "Brasil", NA, "Brasil", NA, "…
#> $ outras.informacoes.relevantes   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ texto.resumo.cv.rh              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Where . -> res means the result was saved to res object.

lapply(xmlsLattes, getDadosGerais) %>>%
    bind_rows %>>%
    (. -> res)

glimpse(res)
#> Rows: 499
#> Columns: 14
#> $ nome.completo                   <chr> "Fernando Migliorini Tenório", "Maria…
#> $ nome.em.citacoes.bibliograficas <chr> "TENÓRIO, F. M.", "SILVA, M. C. J.", …
#> $ nacionalidade                   <chr> "B", "B", "B", "B", "B", "B", "B", "B…
#> $ pais.de.nascimento              <chr> "Brasil", "Brasil", "Brasil", "Brasil…
#> $ uf.nascimento                   <chr> "PR", "BA", "", "BA", "AL", "AL", "MG…
#> $ cidade.nascimento               <chr> "ARAPONGAS", "EUNAPOLIS", "", "Salvad…
#> $ permissao.de.divulgacao         <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "N…
#> $ id                              <chr> "0000000403000690", "0000002580990139…
#> $ data.atualizacao                <chr> "29022008", "11102009", "28012015", "…
#> $ data.falecimento                <chr> NA, "", "", "", "", NA, "", "", "", "…
#> $ sigla.pais.nacionalidade        <chr> NA, NA, "BRA", NA, "BRA", NA, "BRA", …
#> $ pais.de.nacionalidade           <chr> NA, NA, "Brasil", NA, "Brasil", NA, "…
#> $ outras.informacoes.relevantes   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ texto.resumo.cv.rh              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

It is worth remembering that all variable names obtained by get functions are the transcription of the field names in the XML file, the - being replaced with . and the capital letters replaced with lower case letters.

Advisory

Ph.D. Advisory

lapply(xmlsLattes, getOrientacoesDoutorado) %>>%
    bind_rows %>>%
    glimpse()
#> Rows: 55
#> Columns: 24
#> $ natureza                    <chr> "Tese de doutorado", "Tese de doutorado",…
#> $ titulo                      <chr> "Condicionantes geológicos, geomorfológic…
#> $ ano                         <chr> "2000", "2000", "2005", "2005", "2004", "…
#> $ pais                        <chr> "Brasil", "Brasil", "Brasil", "Brasil", "…
#> $ idioma                      <chr> "Português", "Português", "Português", "P…
#> $ home.page                   <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ flag.relevancia             <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "NAO",…
#> $ doi                         <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ titulo.ingles               <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ tipo.de.orientacao          <chr> "CO_ORIENTADOR", "ORIENTADOR_PRINCIPAL", …
#> $ nome.do.orientado           <chr> "Luiz Almeida Prado Bacelar", "Reiner Oli…
#> $ codigo.instituicao          <chr> "020200000009", "020200000009", "02020000…
#> $ nome.da.instituicao         <chr> "Universidade Federal do Rio de Janeiro",…
#> $ codigo.curso                <chr> "31000282", "31000240", "31000240", "3100…
#> $ nome.do.curso               <chr> "Engenharia Civil", "Geografia", "Geograf…
#> $ flag.bolsa                  <chr> "SIM", "NAO", "SIM", "NAO", "SIM", "NAO",…
#> $ codigo.agencia.financiadora <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ nome.da.agencia             <chr> "Fundação de Amparo à Pesquisa do Estado …
#> $ numero.de.paginas           <chr> "226", "163", "0", "398", "217", "150", "…
#> $ numero.id.orientado         <chr> "", "4484335621458630", "", "066901315042…
#> $ nome.do.curso.ingles        <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ id                          <chr> "0000325690951570", "0000325690951570", "…
#> $ codigo.orgao                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ nome.orgao                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Master Advisory

lapply(xmlsLattes, getOrientacoesMestrado) %>>%
    bind_rows %>>%
    glimpse()
#> Rows: 149
#> Columns: 25
#> $ natureza                    <chr> "Dissertação de mestrado", "Dissertação d…
#> $ tipo                        <chr> "ACADEMICO", "ACADEMICO", "ACADEMICO", "A…
#> $ titulo                      <chr> "O Papel da Formiga Sauva (Do Genero Atta…
#> $ ano                         <chr> "1991", "1995", "1997", "1998", "1998", "…
#> $ pais                        <chr> "Brasil", "Brasil", "Brasil", "Brasil", "…
#> $ idioma                      <chr> "Português", "Português", "Português", "P…
#> $ home.page                   <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ flag.relevancia             <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "NAO",…
#> $ doi                         <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ titulo.ingles               <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ tipo.de.orientacao          <chr> "ORIENTADOR_PRINCIPAL", "ORIENTADOR_PRINC…
#> $ nome.do.orientado           <chr> "Carlos Edgar de Deus", "Marcelo Eduardo …
#> $ codigo.instituicao          <chr> "020200000009", "020200000009", "02020000…
#> $ nome.da.instituicao         <chr> "Universidade Federal do Rio de Janeiro",…
#> $ codigo.curso                <chr> "31000240", "31000240", "31000240", "3100…
#> $ nome.do.curso               <chr> "Geografia", "Geografia", "Geografia", "G…
#> $ flag.bolsa                  <chr> "NAO", "SIM", "SIM", "SIM", "SIM", "SIM",…
#> $ codigo.agencia.financiadora <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ nome.da.agencia             <chr> "", "Coordenação de Aperfeiçoamento de Pe…
#> $ numero.de.paginas           <chr> "", "142", "153", "104", "136", "118", "1…
#> $ numero.id.orientado         <chr> "", "6211918426877457", "", "689326802882…
#> $ nome.do.curso.ingles        <chr> "", "", "", "", "", "", "", "", "", "", "…
#> $ id                          <chr> "0000325690951570", "0000325690951570", "…
#> $ codigo.orgao                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ nome.orgao                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Post Doctorate

lapply(xmlsLattes, getOrientacoesPosDoutorado) %>>%
    bind_rows %>>%
    glimpse()
#> Rows: 6
#> Columns: 22
#> $ natureza                    <chr> "Supervisão de pós-doutorado", "Supervisã…
#> $ titulo                      <chr> "", "", "", "", "", ""
#> $ ano                         <chr> "2013", "2013", "2011", "2013", "2011", "…
#> $ pais                        <chr> "Brasil", "Brasil", "Brasil", "Brasil", "…
#> $ idioma                      <chr> "Português", "Português", "Português", "P…
#> $ home.page                   <chr> "", "", "", "", "", ""
#> $ flag.relevancia             <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "NAO"
#> $ doi                         <chr> "", "", "", "", "", ""
#> $ titulo.ingles               <chr> "", "", "", "", "", ""
#> $ tipo.de.orientacao          <chr> "", "", "", "", "", ""
#> $ nome.do.orientado           <chr> "Bruno Henriques Coutinho", "Anderson Mul…
#> $ codigo.instituicao          <chr> "020200000009", "020200000009", "06140000…
#> $ nome.da.instituicao         <chr> "Universidade Federal do Rio de Janeiro",…
#> $ codigo.curso                <chr> "", "", "", "", "", ""
#> $ nome.do.curso               <chr> "", "", "", "", "", ""
#> $ flag.bolsa                  <chr> "SIM", "NAO", "SIM", "SIM", "SIM", "SIM"
#> $ codigo.agencia.financiadora <chr> "", "", "", "", "", ""
#> $ nome.da.agencia             <chr> "Conselho Nacional de Desenvolvimento Cie…
#> $ numero.de.paginas           <chr> "", "", "", "", "", ""
#> $ numero.id.orientado         <chr> "", "", "", "", "", ""
#> $ nome.do.curso.ingles        <chr> "", "", "", "", "", ""
#> $ id                          <chr> "0000325690951570", "0000325690951570", "…

Other

lapply(xmlsLattes, getOrientacoesOutras) %>>%
    bind_rows %>>%
    glimpse()
#> Rows: 1,054
#> Columns: 24
#> $ natureza                     <chr> "MONOGRAFIA_DE_CONCLUSAO_DE_CURSO_APERFE…
#> $ tipo                         <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ titulo                       <chr> "Estudo dos Aspectos da Aerofauna e Inve…
#> $ ano                          <chr> "2005", "2005", "2002", "2002", "2002", …
#> $ pais                         <chr> "Brasil", "Brasil", "Brasil", "Brasil", …
#> $ idioma                       <chr> "Português", "Português", "Português", "…
#> $ home.page                    <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ flag.relevancia              <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "NAO"…
#> $ doi                          <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ titulo.ingles                <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ tipo.ingles                  <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ nome.do.orientado            <chr> "Rodrigo Ribeiro Freitas", "Greicy Zacca…
#> $ codigo.instituicao           <chr> "125200000006", "125200000006", "1252000…
#> $ nome.da.instituicao          <chr> "Universidade do Extremo Sul Catarinense…
#> $ codigo.curso                 <chr> "90000043", "90000043", "60138670", "900…
#> $ nome.do.curso                <chr> "Gestão Ambiental", "Gestão Ambiental", …
#> $ flag.bolsa                   <chr> "NAO", "NAO", "NAO", "NAO", "NAO", "NAO"…
#> $ codigo.agencia.financiadora  <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ nome.da.agencia              <chr> "", "", "", "", "", "", "", "", "", "Con…
#> $ tipo.de.orientacao.concluida <chr> "", "", "", "ORIENTADOR_PRINCIPAL", "ORI…
#> $ numero.de.paginas            <chr> "", "", "", "25", "32", "18", "28", "31"…
#> $ numero.id.orientado          <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ nome.do.curso.ingles         <chr> "", "", "", "", "", "", "", "", "", "", …
#> $ id                           <chr> "0000021871668960", "0000021871668960", …

Published Academic Papers

lapply(xmlsLattes, getArtigosPublicados) %>>%
    bind_rows %>>%
    as_tibble %>>%
    (. -> pub)

head(pub)
#> # A tibble: 6 x 23
#>   natureza titulo.do.artigo ano.do.artigo pais.de.publica… idioma
#>   <chr>    <chr>            <chr>         <chr>            <chr> 
#> 1 COMPLETO Modeling the sp… 2013          ""               Inglês
#> 2 COMPLETO Photorefractive… 2011          ""               Portu…
#> 3 COMPLETO Pump-induced re… 2015          ""               Inglês
#> 4 COMPLETO Frozen waves: e… 2012          ""               Inglês
#> 5 COMPLETO Desenvolvimento… 2013          ""               Portu…
#> 6 COMPLETO Avaliação do Si… 2014          ""               Portu…
#> # … with 18 more variables: meio.de.divulgacao <chr>,
#> #   home.page.do.trabalho <chr>, flag.relevancia <chr>, doi <chr>,
#> #   titulo.do.artigo.ingles <chr>, flag.divulgacao.cientifica <chr>,
#> #   titulo.do.periodico.ou.revista <chr>, issn <chr>, volume <chr>,
#> #   fasciculo <chr>, serie <chr>, pagina.inicial <chr>, pagina.final <chr>,
#> #   local.de.publicacao <chr>, autores <chr>, autores.citacoes <chr>,
#> #   autores.id <chr>, id <chr>

Normalize functions

The information obtained from the Lattes curriculum is not standardized, so each user inserts the information in certain predefined fields. The problem with such an approach is the errors when inserting the data, here are some examples: - two co-authors of an article, each one feeds his or her Lattes curriculum. However, each one can insert different ISSN codes for the same scientific journal, which can change the journal-title, one title as (Print) and the other as (Online). - the two authors may insert different years for the same article. - the same two authors may mistype the journal-title.

The functions normalizeByDoi, normalizeByJournal, and normalizeByYear correct most problems related to data entry errors.

normalizeByDoi groups all articles together as the same DOI code and uses the most frequent information in the title, year, and journal’s name.

# use explict arguments
normalizeByDoi( pub, doi='doi', year='ano.do.artigo', issn='issn', paperTitle='titulo.do.artigo', journalName='titulo.do.periodico.ou.revista')
#> # A tibble: 847 x 28
#>    natureza titulo.do.artig… ano.do.artigo_o… pais.de.publica… idioma
#>    <chr>    <chr>            <chr>            <chr>            <chr> 
#>  1 COMPLETO Modeling the sp… 2013             ""               Inglês
#>  2 COMPLETO Photorefractive… 2011             ""               Portu…
#>  3 COMPLETO Pump-induced re… 2015             ""               Inglês
#>  4 COMPLETO Frozen waves: e… 2012             ""               Inglês
#>  5 COMPLETO Desenvolvimento… 2013             ""               Portu…
#>  6 COMPLETO Avaliação do Si… 2014             ""               Portu…
#>  7 COMPLETO Pinhão manso (J… 2013             ""               Portu…
#>  8 COMPLETO Município de Cé… 2013             ""               Portu…
#>  9 COMPLETO Calculo do Indi… 2013             ""               Portu…
#> 10 COMPLETO Milho (Zea Mays… 2012             ""               Portu…
#> # … with 837 more rows, and 23 more variables: meio.de.divulgacao <chr>,
#> #   home.page.do.trabalho <chr>, flag.relevancia <chr>, doi.x <chr>,
#> #   titulo.do.artigo.ingles <chr>, flag.divulgacao.cientifica <chr>,
#> #   titulo.do.periodico.ou.revista_old <chr>, issn_old <chr>, volume <chr>,
#> #   fasciculo <chr>, serie <chr>, pagina.inicial <chr>, pagina.final <chr>,
#> #   local.de.publicacao <chr>, autores <chr>, autores.citacoes <chr>,
#> #   autores.id <chr>, id <chr>, doi.y <chr>,
#> #   titulo.do.periodico.ou.revista <chr>, issn <chr>, ano.do.artigo <chr>,
#> #   titulo.do.artigo <chr>

# use de defult data frame from getArtigosPublicados
normalizeByDoi(pub)
#> # A tibble: 847 x 28
#>    natureza titulo.do.artig… ano.do.artigo_o… pais.de.publica… idioma
#>    <chr>    <chr>            <chr>            <chr>            <chr> 
#>  1 COMPLETO Modeling the sp… 2013             ""               Inglês
#>  2 COMPLETO Photorefractive… 2011             ""               Portu…
#>  3 COMPLETO Pump-induced re… 2015             ""               Inglês
#>  4 COMPLETO Frozen waves: e… 2012             ""               Inglês
#>  5 COMPLETO Desenvolvimento… 2013             ""               Portu…
#>  6 COMPLETO Avaliação do Si… 2014             ""               Portu…
#>  7 COMPLETO Pinhão manso (J… 2013             ""               Portu…
#>  8 COMPLETO Município de Cé… 2013             ""               Portu…
#>  9 COMPLETO Calculo do Indi… 2013             ""               Portu…
#> 10 COMPLETO Milho (Zea Mays… 2012             ""               Portu…
#> # … with 837 more rows, and 23 more variables: meio.de.divulgacao <chr>,
#> #   home.page.do.trabalho <chr>, flag.relevancia <chr>, doi.x <chr>,
#> #   titulo.do.artigo.ingles <chr>, flag.divulgacao.cientifica <chr>,
#> #   titulo.do.periodico.ou.revista_old <chr>, issn_old <chr>, volume <chr>,
#> #   fasciculo <chr>, serie <chr>, pagina.inicial <chr>, pagina.final <chr>,
#> #   local.de.publicacao <chr>, autores <chr>, autores.citacoes <chr>,
#> #   autores.id <chr>, id <chr>, doi.y <chr>,
#> #   titulo.do.periodico.ou.revista <chr>, issn <chr>, ano.do.artigo <chr>,
#> #   titulo.do.artigo <chr>

Because not every article is DOI-coded, we can still normalize the journal name and ISSN with normalizeByJournal. The result is two new columns added, issn_old and titulo.do.periodico.ou.revista_old that allow us to analyze the results of the substitutions. The more curricula you analyze, the more useful the normalize functions are.

# use explict arguments
nj <- normalizeJournals(pub, issn='issn', journalName='titulo.do.periodico.ou.revista')

# use de defult data frame from getArtigosPublicados
nj <- normalizeJournals(pub)

nj %>>%
    select(issn_old, issn, titulo.do.periodico.ou.revista_old, titulo.do.periodico.ou.revista) %>>%
    tail
#> # A tibble: 6 x 4
#>   issn_old issn     titulo.do.periodico.ou.revista… titulo.do.periodico.ou.revi…
#>   <chr>    <chr>    <chr>                           <chr>                       
#> 1 23180587 23180587 Revista de Pesquisa em Saúde    Revista de Pesquisa em Saúde
#> 2 23180587 23180587 Revista de Pesquisa em Saúde    Revista de Pesquisa em Saúde
#> 3 23180587 23180587 Revista de Pesquisa em Saúde    Revista de Pesquisa em Saúde
#> 4 23401079 23401079 INTED 2015 PROCEEDINGS          INTED 2015 PROCEEDINGS      
#> 5 23594330 23594330 Rev. de Atenção à Saúde         Rev. de Atenção à Saúde     
#> 6 511 8    511 8    Revista Brasileira de Computaç… Revista Brasileira de Compu…

Finally, if two papers have the same title and were published in the same journal, the year can be normalized with:

# use explict arguments
ny <- normalizeYears(pub, year2normalize='ano.do.artigo',issn='issn',journalName='titulo.do.periodico.ou.revista',paperTitle='titulo.do.artigo')

# use de defult variables names from getArtigosPublicados
ny <- normalizeYears(pub)

ny %>>%
    select(ano_old, ano, issn, titulo.do.periodico.ou.revista, titulo.do.artigo) %>>%
    head
#> # A tibble: 6 x 5
#>   ano_old ano   issn    titulo.do.periodico.ou.… titulo.do.artigo               
#>   <chr>   <chr> <chr>   <chr>                    <chr>                          
#> 1 1974    1974  "00066… Boletim Paulista de Geo… Formacao Macacu: variações tex…
#> 2 1979    1979  ""      Revista Brasileira de H… Análise de Frequência de Chuva…
#> 3 1980    1980  "00347… Revista Brasileira de G… Os Solos e Hidrologia das Enco…
#> 4 1981    1981  ""      REVISTA BRASILEIRA DE H… Ritmos e Variabialidade das Pr…
#> 5 1982    1982  "00218… Journal of Algebra (Pri… On weak commutativity between …
#> 6 1983    1983  "02636… Journal of Hypertension  Plasma endogenous sodium pump …

To type less, we can do:

lapply(xmlsLattes, getArtigosPublicados) %>>%
    bind_rows %>>%
    as_tibble %>>%
    normalizeByDoi %>>%
    normalizeJournals %>>%
    normalizeYears %>>%
    select(titulo.do.artigo,ano.do.artigo,issn,titulo.do.periodico.ou.revista,id) %>>%
    slice(1:10)
#> # A tibble: 10 x 5
#>    titulo.do.artigo           ano.do.artigo issn   titulo.do.periodico.o… id    
#>    <chr>                      <chr>         <chr>  <chr>                  <chr> 
#>  1 Formacao Macacu: variaçõe… 1974          00066… Boletim Paulista de G… 00003…
#>  2 Análise de Frequência de … 1979          noISS… Revista Brasileira de… 00003…
#>  3 Os Solos e Hidrologia das… 1980          00347… Revista Brasileira de… 00003…
#>  4 Ritmos e Variabialidade d… 1981          noISS… REVISTA BRASILEIRA DE… 00003…
#>  5 On weak commutativity bet… 1982          00218… Journal of Algebra (P… 00005…
#>  6 Plasma endogenous sodium … 1983          02636… Journal of Hypertensi… 00001…
#>  7 Central Hypertensive Effe… 1984          00039… Archives Internationa… 00001…
#>  8 Uma Nota Sobre A Estrutur… 1985          01028… Matemática Universitá… 00005…
#>  9 Diagonal Embeddings Of Ni… 1986          00192… Illinois Journal of M… 00005…
#> 10 A Grande Alma              1986          noISS… Crônicas: Este Mundo … 00002…

Merge data

To join the data from different tables the key is the variable id, which is a unique 16 digit code.

lapply(xmlsLattes, getArtigosPublicados) %>>%
    bind_rows %>>%
    as_tibble %>>%
    normalizeByDoi %>>%
    normalizeJournals %>>%
    normalizeYears %>>%
    select(titulo.do.artigo,ano.do.artigo,issn,titulo.do.periodico.ou.revista,id) %>>%
    left_join( lapply(xmlsLattes, getDadosGerais) %>>% bind_rows %>>% select(id,nome.completo,pais.de.nascimento)) %>>%
    slice(1:10)
#> # A tibble: 10 x 7
#>    titulo.do.artigo ano.do.artigo issn  titulo.do.perio… id    nome.completo
#>    <chr>            <chr>         <chr> <chr>            <chr> <chr>        
#>  1 Formacao Macacu… 1974          0006… Boletim Paulist… 0000… Ana Luiza Co…
#>  2 Análise de Freq… 1979          noIS… Revista Brasile… 0000… Ana Luiza Co…
#>  3 Os Solos e Hidr… 1980          0034… Revista Brasile… 0000… Ana Luiza Co…
#>  4 Ritmos e Variab… 1981          noIS… REVISTA BRASILE… 0000… Ana Luiza Co…
#>  5 On weak commuta… 1982          0021… Journal of Alge… 0000… Norai Romeu …
#>  6 Plasma endogeno… 1983          0263… Journal of Hype… 0000… Luiza Cristi…
#>  7 Central Hyperte… 1984          0003… Archives Intern… 0000… Luiza Cristi…
#>  8 Uma Nota Sobre … 1985          0102… Matemática Univ… 0000… Norai Romeu …
#>  9 Diagonal Embedd… 1986          0019… Illinois Journa… 0000… Norai Romeu …
#> 10 A Grande Alma    1986          noIS… Crônicas: Este … 0000… Ruy Magalhãe…
#> # … with 1 more variable: pais.de.nascimento <chr>