Introdução aos Documentos HTML

Prof. Walmes Marques Zeviani

07 Mar 2017

Objetivo e justificativa

HTML

Aspectos gerais

Estrutura do HTML

Uma página HTML

Tags mais usadas

Lendo uma página com o R

library(XML)
url <- "http://leg.ufpr.br/~walmes/ensino/"
h <- htmlParse(url)
summary(h)
## $nameCounts
## 
##      td       a      tr     img      th      hr address    body 
##     210      46      45      43       7       2       1       1 
##      h1    head    html   table   title 
##       1       1       1       1       1 
## 
## $numNodes
## [1] 360
h
## <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
## <html>
## <head><title>Index of /~walmes/ensino</title></head>
## <body>
## <h1>Index of /~walmes/ensino</h1>
##   <table>
## <tr>
## <th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th>
## <th><a href="?C=N;O=D">Name</a></th>
## <th><a href="?C=M;O=A">Last modified</a></th>
## <th><a href="?C=S;O=A">Size</a></th>
## <th><a href="?C=D;O=A">Description</a></th>
## </tr>
## <tr><th colspan="5"><hr></th></tr>
## <tr>
## <td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td>
## <td><a href="/~walmes/">Parent Directory</a></td>
## <td> </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="CEQ/">CEQ/</a></td>
## <td align="right">2019-03-27 18:50  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="CPI/">CPI/</a></td>
## <td align="right">2018-12-18 12:13  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="EC2/">EC2/</a></td>
## <td align="right">2018-12-14 17:42  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ML/">ML/</a></td>
## <td align="right">2018-12-07 10:43  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/layout.gif" alt="[   ]"></td>
## <td><a href="Modelos-lineares-Joao-Gil-de-Luna.pdf">Modelos-lineares-Joao-Gil-de-Luna.pdf</a></td>
## <td align="right">2016-08-14 21:40  </td>
## <td align="right">535K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="af722-2014-01/">af722-2014-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="apv7037/">apv7037/</a></td>
## <td align="right">2016-11-25 13:38  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce001mb-2014-01/">ce001mb-2014-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce001mb-2015-01/">ce001mb-2015-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce002b-2011-02/">ce002b-2011-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce002b-2012-01/">ce002b-2012-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce003-n2n3-2010-02/">ce003-n2n3-2010-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce003b-2011-01/">ce003b-2011-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce063-2015-01/">ce063-2015-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce064-2015-02/">ce064-2015-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce071-2014-01/">ce071-2014-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce074-2012-02/">ce074-2012-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2011-02/">ce083-2011-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2012-01/">ce083-2012-01/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2012-02/">ce083-2012-02/</a></td>
## <td align="right">2016-08-10 10:03  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2013-01/">ce083-2013-01/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2013-02/">ce083-2013-02/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2014-01/">ce083-2014-01/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2014-02/">ce083-2014-02/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce089-2014-02/">ce089-2014-02/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce089-2015-02/">ce089-2015-02/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce213-2015-01/">ce213-2015-01/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce223-2011-01/">ce223-2011-01/</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="dsbd/">dsbd/</a></td>
## <td align="right">2018-06-30 13:29  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/unknown.gif" alt="[   ]"></td>
## <td><a href="e213-2015-01">e213-2015-01</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">796K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="extensoes/">extensoes/</a></td>
## <td align="right">2017-10-30 18:57  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="leg-mapa.png">leg-mapa.png</a></td>
## <td align="right">2016-08-10 10:35  </td>
## <td align="right">1.4M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="leg_mapa.jpg">leg_mapa.jpg</a></td>
## <td align="right">2016-10-13 09:31  </td>
## <td align="right">193K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="legmapa.png">legmapa.png</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right">5.2M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="mintex/">mintex/</a></td>
## <td align="right">2019-03-26 20:42  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="passos_datafilehost.png">passos_datafilehost.png</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right"> 85K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="passos_discussao.png">passos_discussao.png</a></td>
## <td align="right">2016-08-10 10:04  </td>
## <td align="right"> 45K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="pesq-reprod/">pesq-reprod/</a></td>
## <td align="right">2018-07-05 18:44  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="teste/">teste/</a></td>
## <td align="right">2018-04-19 18:33  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td>
## <td><a href="visualizacao-dados.html">visualizacao-dados.html</a></td>
## <td align="right">2018-04-11 15:17  </td>
## <td align="right"> 18M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="web-scraping/">web-scraping/</a></td>
## <td align="right">2018-02-26 00:10  </td>
## <td align="right">  - </td>
## <td> </td>
## </tr>
## <tr><th colspan="5"><hr></th></tr>
## </table>
## <address>Apache/2.4.10 (Debian) Server at leg.ufpr.br Port 80</address>
## </body>
## </html>
## 
# Extraí as tabelas: <table>.
tb <- readHTMLTable(h)
tb
## $`NULL`
##                                      Name    Last modified Size
## 1                                    <NA>             <NA> <NA>
## 2                        Parent Directory                     -
## 3                                    CEQ/ 2019-03-27 18:50    -
## 4                                    CPI/ 2018-12-18 12:13    -
## 5                                    EC2/ 2018-12-14 17:42    -
## 6                                     ML/ 2018-12-07 10:43    -
## 7   Modelos-lineares-Joao-Gil-de-Luna.pdf 2016-08-14 21:40 535K
## 8                          af722-2014-01/ 2016-08-10 10:03    -
## 9                                apv7037/ 2016-11-25 13:38    -
## 10                       ce001mb-2014-01/ 2016-08-10 10:03    -
## 11                       ce001mb-2015-01/ 2016-08-10 10:03    -
## 12                        ce002b-2011-02/ 2016-08-10 10:03    -
## 13                        ce002b-2012-01/ 2016-08-10 10:03    -
## 14                    ce003-n2n3-2010-02/ 2016-08-10 10:03    -
## 15                        ce003b-2011-01/ 2016-08-10 10:03    -
## 16                         ce063-2015-01/ 2016-08-10 10:03    -
## 17                         ce064-2015-02/ 2016-08-10 10:03    -
## 18                         ce071-2014-01/ 2016-08-10 10:03    -
## 19                         ce074-2012-02/ 2016-08-10 10:03    -
## 20                         ce083-2011-02/ 2016-08-10 10:03    -
## 21                         ce083-2012-01/ 2016-08-10 10:03    -
## 22                         ce083-2012-02/ 2016-08-10 10:03    -
## 23                         ce083-2013-01/ 2016-08-10 10:04    -
## 24                         ce083-2013-02/ 2016-08-10 10:04    -
## 25                         ce083-2014-01/ 2016-08-10 10:04    -
## 26                         ce083-2014-02/ 2016-08-10 10:04    -
## 27                         ce089-2014-02/ 2016-08-10 10:04    -
## 28                         ce089-2015-02/ 2016-08-10 10:04    -
## 29                         ce213-2015-01/ 2016-08-10 10:04    -
## 30                         ce223-2011-01/ 2016-08-10 10:04    -
## 31                                  dsbd/ 2018-06-30 13:29    -
## 32                           e213-2015-01 2016-08-10 10:04 796K
## 33                             extensoes/ 2017-10-30 18:57    -
## 34                           leg-mapa.png 2016-08-10 10:35 1.4M
## 35                           leg_mapa.jpg 2016-10-13 09:31 193K
## 36                            legmapa.png 2016-08-10 10:04 5.2M
## 37                                mintex/ 2019-03-26 20:42    -
## 38                passos_datafilehost.png 2016-08-10 10:04  85K
## 39                   passos_discussao.png 2016-08-10 10:04  45K
## 40                           pesq-reprod/ 2018-07-05 18:44    -
## 41                                 teste/ 2018-04-19 18:33    -
## 42                visualizacao-dados.html 2018-04-11 15:17  18M
## 43                          web-scraping/ 2018-02-26 00:10    -
## 44                                   <NA>             <NA> <NA>
##    Description
## 1         <NA>
## 2             
## 3             
## 4             
## 5             
## 6             
## 7             
## 8             
## 9             
## 10            
## 11            
## 12            
## 13            
## 14            
## 15            
## 16            
## 17            
## 18            
## 19            
## 20            
## 21            
## 22            
## 23            
## 24            
## 25            
## 26            
## 27            
## 28            
## 29            
## 30            
## 31            
## 32            
## 33            
## 34            
## 35            
## 36            
## 37            
## 38            
## 39            
## 40            
## 41            
## 42            
## 43            
## 44        <NA>
# Extraí todos os hiperlinks: <a href>.
lk <- getHTMLLinks(h)
lk
##  [1] "?C=N;O=D"                             
##  [2] "?C=M;O=A"                             
##  [3] "?C=S;O=A"                             
##  [4] "?C=D;O=A"                             
##  [5] "/~walmes/"                            
##  [6] "CEQ/"                                 
##  [7] "CPI/"                                 
##  [8] "EC2/"                                 
##  [9] "ML/"                                  
## [10] "Modelos-lineares-Joao-Gil-de-Luna.pdf"
## [11] "af722-2014-01/"                       
## [12] "apv7037/"                             
## [13] "ce001mb-2014-01/"                     
## [14] "ce001mb-2015-01/"                     
## [15] "ce002b-2011-02/"                      
## [16] "ce002b-2012-01/"                      
## [17] "ce003-n2n3-2010-02/"                  
## [18] "ce003b-2011-01/"                      
## [19] "ce063-2015-01/"                       
## [20] "ce064-2015-02/"                       
## [21] "ce071-2014-01/"                       
## [22] "ce074-2012-02/"                       
## [23] "ce083-2011-02/"                       
## [24] "ce083-2012-01/"                       
## [25] "ce083-2012-02/"                       
## [26] "ce083-2013-01/"                       
## [27] "ce083-2013-02/"                       
## [28] "ce083-2014-01/"                       
## [29] "ce083-2014-02/"                       
## [30] "ce089-2014-02/"                       
## [31] "ce089-2015-02/"                       
## [32] "ce213-2015-01/"                       
## [33] "ce223-2011-01/"                       
## [34] "dsbd/"                                
## [35] "e213-2015-01"                         
## [36] "extensoes/"                           
## [37] "leg-mapa.png"                         
## [38] "leg_mapa.jpg"                         
## [39] "legmapa.png"                          
## [40] "mintex/"                              
## [41] "passos_datafilehost.png"              
## [42] "passos_discussao.png"                 
## [43] "pesq-reprod/"                         
## [44] "teste/"                               
## [45] "visualizacao-dados.html"              
## [46] "web-scraping/"

Resumo

Próxima semana

Referências