Expressões regulares

Prof. Walmes Marques Zeviani

23 Mar 2017

Análise de texto sempre requer algum processamento.
Definir o que é expressão regular.
Listar material e recursos para aprendizado.
Apresentar os recursos do R.
Introduzir o sed e awk.

Detalhes

Regular Expression = regex ou regexp.
Sequencia concisa de (meta)caracteres que definem um padrão.
1950s - Stephen Cole Kleene (matemático).
O conceito passou a ser usado nos editores de texto do Unix.
Usado em:
- Motores de busca.
- Editores/processadores de texto.
- Análise léxica.
- Para “procurar” ou “procurar e substituir”.
- Extrair informações chave de documentos de texto (TEL, CEP, CPF, email, datas, horários, cifras monetárias).
Sintaxes:
- Padrão POSIX.
- Padrão Perl.

Exemplo

Figura 1: Expressão regular para bater com emails. Fonte: http://learnwebtutorials.com/why-regular-expression-is-so-confusing.

Folhas de cola

Testadores online

Praticar

Data
CEP
Valor monetário

R

help(regex, help_type = "html")

grep() e grepl(): Detecta o padrão.
sub() e gsub(): Substitui o padrão.
regexpr() e gregexpr(): Localiza o padrão.
strsplit(): Divide no padrão.

Dentro de strings, deve-se usar duplo contra barra.
O default é POSIX. perl = TRUE para usar o Perl.

Exemplos

s <- colors()

# Cores que tenham "red".
grep(pattern = "red", x = s)

##  [1] 100 372 373 374 375 376 476 503 504 505 506 507 524 525 526 527
## [17] 528 552 553 554 555 556 641 642 643 644 645

grep(pattern = "red", x = s, value = TRUE)

##  [1] "darkred"         "indianred"       "indianred1"     
##  [4] "indianred2"      "indianred3"      "indianred4"     
##  [7] "mediumvioletred" "orangered"       "orangered1"     
## [10] "orangered2"      "orangered3"      "orangered4"     
## [13] "palevioletred"   "palevioletred1"  "palevioletred2" 
## [16] "palevioletred3"  "palevioletred4"  "red"            
## [19] "red1"            "red2"            "red3"           
## [22] "red4"            "violetred"       "violetred1"     
## [25] "violetred2"      "violetred3"      "violetred4"

# Cores que começam com "red".
grep(pattern = "^red", x = s, value = TRUE)

## [1] "red"  "red1" "red2" "red3" "red4"

# Que tenham "red" seguido de um número.
grep(pattern = "red\\d", x = s, value = TRUE)

##  [1] "indianred1"     "indianred2"     "indianred3"    
##  [4] "indianred4"     "orangered1"     "orangered2"    
##  [7] "orangered3"     "orangered4"     "palevioletred1"
## [10] "palevioletred2" "palevioletred3" "palevioletred4"
## [13] "red1"           "red2"           "red3"          
## [16] "red4"           "violetred1"     "violetred2"    
## [19] "violetred3"     "violetred4"

# grep(pattern = "red[0-9]", x = s, value = TRUE)

# Extrair o que está a esqueda de "red".
red <- grep(pattern = "red\\d", x = s, value = TRUE)
sub(pattern = "^(.*)red.*", replacement = "\\1", x = red)

##  [1] "indian"     "indian"     "indian"     "indian"    
##  [5] "orange"     "orange"     "orange"     "orange"    
##  [9] "paleviolet" "paleviolet" "paleviolet" "paleviolet"
## [13] ""           ""           ""           ""          
## [17] "violet"     "violet"     "violet"     "violet"

Pacote `stringr`

https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html.

library(stringr)
ls("package:stringr")

##  [1] "%>%"             "boundary"        "coll"           
##  [4] "fixed"           "fruit"           "ignore.case"    
##  [7] "invert_match"    "perl"            "regex"          
## [10] "sentences"       "str_c"           "str_conv"       
## [13] "str_count"       "str_detect"      "str_dup"        
## [16] "str_extract"     "str_extract_all" "str_interp"     
## [19] "str_join"        "str_length"      "str_locate"     
## [22] "str_locate_all"  "str_match"       "str_match_all"  
## [25] "str_order"       "str_pad"         "str_replace"    
## [28] "str_replace_all" "str_replace_na"  "str_sort"       
## [31] "str_split"       "str_split_fixed" "str_sub"        
## [34] "str_sub<-"       "str_subset"      "str_to_lower"   
## [37] "str_to_title"    "str_to_upper"    "str_trim"       
## [40] "str_trunc"       "str_view"        "str_view_all"   
## [43] "str_which"       "str_wrap"        "word"           
## [46] "words"

Praticar

Nomes de bebes que terminal com “ana”

url <- "http://bebeatual.com/nomes-de-menina-letra-%s"
# browseURL(sprintf(url, "A"))

# Lista para guardar os resultados.
nms <- vector(mode = "list", length = length(LETTERS))

for (i in LETTERS) {
    cat("Lendo letra:", i, "\n")
    u <- sprintf(url, i)
    h <- htmlParse(u, encoding = "utf-8")
    nms[[i]] <-
        xpathSApply(h,
                    path = "//div[@id = 'centra2']/div[@class]",
                    fun = xmlValue,
                    trim = TRUE)
}

# Desmonta a lista para transformar em vetor.
nomes <- unlist(nms, use.names = FALSE)

# Remove o underscore esquisito.
head(nomes)
nomes <- sub(" ", "", nomes)

# Nomes que terminam com "ana".
grep(pattern = "ana$", x = nomes, value = TRUE)

Começam com “wal”

url <- "http://bebeatual.com/nomes-de-menino-letra-%s"

# Lista para guardar os resultados.
nms <- vector(mode = "list", length = length(LETTERS))

for (i in LETTERS) {
    cat("Lendo letra:", i, "\n")
    u <- sprintf(url, i)
    h <- htmlParse(u, encoding = "utf-8")
    nms[[i]] <-
        xpathSApply(h,
                    path = "//div[@id = 'centra2']/div[@class]",
                    fun = xmlValue,
                    trim = TRUE)
}

# Desmonta a lista para transformar em vetor.
nomes <- unlist(nms, use.names = FALSE)

# Remove o underscore esquisito.
head(nomes)
nomes <- sub(" ", "", nomes)

# Nomes que começam com "wal".
grep(pattern = "^wal", x = nomes, value = TRUE, ignore.case = TRUE)

# Walmir? Waldir? Waldemir?

Tratamento de endereços

url <- paste0("http://www.imovelweb.com.br/",
              "apartamentos-venda-centro-curitiba.html")
# browseURL(url)
h <- htmlParse(url)

# Endereços dos imóveis.
end <- xpathSApply(h,
                   path = paste0("//ul[@class = 'list-posts']/li",
                                 "//div[@class = 'post-location']"),
                   fun = xmlValue,
                   recursive = FALSE,
                   trim = TRUE)
end

##  [1] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Conselheiro Laurindo\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"            
##  [2] "\n\t\t\t\t\t\n\t\t\t\t\t\tAvenida Visconde de Guarapuava\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"      
##  [3] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Conselheiro Laurindo, \n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"          
##  [4] "\n\t\t\t\t\t\n\t\t\t\t\t\tAvenida Visconde de Guarapuava, 3806\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"
##  [5] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Amintas de Barros, 240\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"          
##  [6] "\n\t\t\t\t\t\n\t\t\t\t\t\tAv. Visconde de Guarapuava\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"          
##  [7] "\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t"                                                                 
##  [8] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Amintas de Barros, 240\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"          
##  [9] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Doutor Pedrosa\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"                  
## [10] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Lourenço Pinto\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"                  
## [11] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Benjamin Constant 316\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"           
## [12] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Nunes Machado\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"                   
## [13] "\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t"                                                                 
## [14] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Barão do Rio Branco\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"             
## [15] "\n\t\t\t\t\t\n\t\t\t\t\t\tAvenida Sete de Setembro\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"            
## [16] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Frei Eurico de Mello, 131\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"       
## [17] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Emiliano Perneta, 500\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"           
## [18] "\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t"                                                                 
## [19] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Paula Gomes, 198\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"                
## [20] "\n\t\t\t\t\t\n\t\t\t\t\t\tRua Paula Gomes,696\n\t\t\t\t\t\t-\n\t\t\t\t\t\n\t\t\t\t"

# Retirar os espaços, tabs e quebras de linha.
end <- gsub("[[:space:]][[:space:]-]+", "", end)
end

##  [1] "Rua Conselheiro Laurindo"            
##  [2] "Avenida Visconde de Guarapuava"      
##  [3] "Rua Conselheiro Laurindo,"           
##  [4] "Avenida Visconde de Guarapuava, 3806"
##  [5] "Rua Amintas de Barros, 240"          
##  [6] "Av. Visconde de Guarapuava"          
##  [7] ""                                    
##  [8] "Rua Amintas de Barros, 240"          
##  [9] "Rua Doutor Pedrosa"                  
## [10] "Rua Lourenço Pinto"                  
## [11] "Rua Benjamin Constant 316"           
## [12] "Rua Nunes Machado"                   
## [13] ""                                    
## [14] "Rua Barão do Rio Branco"             
## [15] "Avenida Sete de Setembro"            
## [16] "Rua Frei Eurico de Mello, 131"       
## [17] "Rua Emiliano Perneta, 500"           
## [18] ""                                    
## [19] "Rua Paula Gomes, 198"                
## [20] "Rua Paula Gomes,696"

# Verificar quais os endereços com número.
# grep(pattern = "\\d+$", x = end, value = TRUE)
end <- grep(pattern = "[[:digit:]]+$", x = end, value = TRUE)
end

## [1] "Avenida Visconde de Guarapuava, 3806"
## [2] "Rua Amintas de Barros, 240"          
## [3] "Rua Amintas de Barros, 240"          
## [4] "Rua Benjamin Constant 316"           
## [5] "Rua Frei Eurico de Mello, 131"       
## [6] "Rua Emiliano Perneta, 500"           
## [7] "Rua Paula Gomes, 198"                
## [8] "Rua Paula Gomes,696"

Visualizando os resultados

library(ggmap)

# Deixando o endereço mais completo.
adr <- paste(end, "Centro, Curitiba")

# Latitude e longitude dos imóveis.
# ll <- geocode(location = end)

suppressMessages(library(googleVis))
# help(gvisMap, h = "html")

mark <- gvisMap(data = data.frame(end = adr, tip = end),
                locationvar = "end",
                tipvar = "tip",
                options = list(showTip = TRUE,
                               showLine = TRUE,
                               enableScrollWheel = TRUE,
                               mapType = "terrain",
                               useMapTypeControl = TRUE))
options(gvis.plot.tag = "chart")

plot(mark)

grep

# Exibe a lista de arquivos que contém ocorrência de "title:".
grep -r 'title:'

# # Restringe para arquivos de extensão Rmd e numera as linhas.
grep -n -r --include=*.Rmd 'title:'

# Esconde (hide) o nome do arquivo.
grep -h -r --include=*.Rmd 'title:'

sed

sed: Stream EDitor.
https://www.gnu.org/software/sed/manual/sed.html.
https://www.gnu.org/software/sed/manual/sed.pdf (81 páginas).
Permite processar texto dentro de um diretório.
- Trocar “Walmes Zeviani” por “Prof. Walmes Zeviani” em todos os arquivos de texto.
Muito rápido.

sed OPCOES... [REGEX] [ENTRADA...]
      |          |       |
      |          |       ` entradas de texto
      |          ` instrução regex
      ` opções que variam o processamento

sed --help

## Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
## 
##   -n, --quiet, --silent
##                  suppress automatic printing of pattern space
##   -e script, --expression=script
##                  add the script to the commands to be executed
##   -f script-file, --file=script-file
##                  add the contents of script-file to the commands to be executed
##   --follow-symlinks
##                  follow symlinks when processing in place
##   -i[SUFFIX], --in-place[=SUFFIX]
##                  edit files in place (makes backup if SUFFIX supplied)
##   -l N, --line-length=N
##                  specify the desired line-wrap length for the `l' command
##   --posix
##                  disable all GNU extensions.
##   -r, --regexp-extended
##                  use extended regular expressions in the script.
##   -s, --separate
##                  consider files as separate rather than as a single continuous
##                  long stream.
##   -u, --unbuffered
##                  load minimal amounts of data from the input files and flush
##                  the output buffers more often
##   -z, --null-data
##                  separate lines by NUL characters
##       --help     display this help and exit
##       --version  output version information and exit
## 
## If no -e, --expression, -f, or --file option is given, then the first
## non-option argument is taken as the sed script to interpret.  All
## remaining arguments are names of input files; if no input files are
## specified, then the standard input is read.
## 
## GNU sed home page: <http://www.gnu.org/software/sed/>.
## General help using GNU software: <http://www.gnu.org/gethelp/>.
## E-mail bug reports to: <bug-sed@gnu.org>.
## Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.

head toda-forma-de-amor.txt

# Mostra só a primeira linha.
sed -n '1p' toda-forma-de-amor.txt

## Eu não pedi pra nascer

# Mostra intervalo de linhas.
sed -n '1,5p' toda-forma-de-amor.txt

## Eu não pedi pra nascer
## Eu não nasci pra perder
## Nem vou sobrar de vítima
## Das circunstâncias

# Linhas com ocorrência.
sed -n '/a gente/p' toda-forma-de-amor.txt

## E a gente vive junto
## E a gente se dá bem
## E a gente vai à luta
## E a gente vive junto
## E a gente se dá bem
## E a gente vai à luta

# Encontra e substitui.
sed 's/a gente/A GENTE/g' toda-forma-de-amor.txt | tail

## E só traz o que quer
## Eu sou teu homem
## Você é minha mulher
## 
## E A GENTE vive junto
## E A GENTE se dá bem
## Não desejamos mal a quase ninguém
## E A GENTE vai à luta
## E conhece a dor
## Consideramos justa toda forma de amor

# Adiciona virgula no final de cada linha.
cp toda-forma-de-amor.txt teste.txt
sed -i '$!s/$/,/' teste.txt

# Adiciona um [ na primeira linha.
# sed -i '1s/^/[/' teste.txt
sed -i '1i [' teste.txt

# Remove a vírgula e adiciona um ] na última linha.
# sed -i '$s/,$/]/' teste.txt
echo "]" >> teste.txt

head -n 3 teste.txt
tail -n 3 teste.txt

# sed '1i [' toda-forma-de-amor.txt
# sed '$si ]' toda-forma-de-amor.txt

## [
## Eu não pedi pra nascer,
## Eu não nasci pra perder,
## E conhece a dor,
## Consideramos justa toda forma de amor
## ]

awk

awk: abreviação dos autores Aho, Weinberger e Kernighan.
Escrito em 1977 no AT&T Bell Laboratories.
GNU awk (gwak) é a versão mais popular.
Paul Rubin escreveu o gawk em 1986.
awk significa o programa e a linguagem (assim como o R).
awk procura e processa linhas de texto em arquivos.
https://www.gnu.org/software/gawk/manual/gawk.html.
https://www.gnu.org/software/gawk/manual/gawk.pdf (540 páginas).
https://www.math.utah.edu/docs/info/gawk_19.html.

awk --help

## Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
## Usage: awk [POSIX or GNU style options] [--] 'program' file ...
## POSIX options:       GNU long options: (standard)
##  -f progfile     --file=progfile
##  -F fs           --field-separator=fs
##  -v var=val      --assign=var=val
## Short options:       GNU long options: (extensions)
##  -b          --characters-as-bytes
##  -c          --traditional
##  -C          --copyright
##  -d[file]        --dump-variables[=file]
##  -D[file]        --debug[=file]
##  -e 'program-text'   --source='program-text'
##  -E file         --exec=file
##  -g          --gen-pot
##  -h          --help
##  -i includefile      --include=includefile
##  -l library      --load=library
##  -L[fatal|invalid]   --lint[=fatal|invalid]
##  -M          --bignum
##  -N          --use-lc-numeric
##  -n          --non-decimal-data
##  -o[file]        --pretty-print[=file]
##  -O          --optimize
##  -p[file]        --profile[=file]
##  -P          --posix
##  -r          --re-interval
##  -S          --sandbox
##  -t          --lint-old
##  -V          --version
## 
## To report bugs, see node `Bugs' in `gawk.info', which is
## section `Reporting Problems and Bugs' in the printed version.
## 
## gawk is a pattern scanning and processing language.
## By default it reads standard input and writes standard output.
## 
## Examples:
##  gawk '{ sum += $1 }; END { print sum }' file
##  gawk -F: '{ print $1 }' /etc/passwd

Usei para processar microdados do ENEM.

http://portal.inep.gov.br/microdados.
Enem 2015 é um ZIP com 1.1GB.

Livros

Expressões regulares são indispensáveis para tratamento de texto.
O R possui utilidades para regex no pacote base.
O pacote stringr contém wrappers para processamento de texto.
No Linux, sed e awk são úteis para trabalhar lotes de arquivos e arquivos gigantes.

Não teremos aula na terça.
Motivo: Escola de Modelos de Regressão.

Expressões regulares

Objetivo e justificativa

Expressões regulares

Detalhes

Exemplo

Folhas de cola

Testadores online

Praticar

Software

R

Exemplos

Pacote `stringr`

Praticar

Nomes de bebes que terminal com “ana”

Começam com “wal”

Tratamento de endereços

Visualizando os resultados

Recursos do Linux

grep

sed

awk

Livros

Resumo

Próxima aula

Referências

Expressões regulares

Objetivo e justificativa

Expressões regulares

Detalhes

Exemplo

Folhas de cola

Testadores online

Praticar

Software

R

Exemplos

Pacote stringr

Praticar

Nomes de bebes que terminal com “ana”

Começam com “wal”

Tratamento de endereços

Visualizando os resultados

Recursos do Linux

grep

sed

awk

Livros

Resumo

Próxima aula

Referências

Pacote `stringr`