Prof. Walmes Marques Zeviani
07 Mar 2017
<html>
que contém os elementos
<head>
: funciona como um preâmbulo, tem metadados e carrega arquivos.<body>
: contém o conteúdo a ser exibido pelo navegador.<!-- Um comentário -->
.&
e termina com ;
<
é <
e >
é >
.<p>
é a tag para parágrafos.<b>
, <i>
, <strong>
e <emph>
são tags para formatação de texto em negrito e itálico.<a href="http://...">
vem de âncora e é usado para criar hiperlinks dentro da página ou para fora da página<ol>
e <ul>
são tags que demarcam listas ordenadas (numeradas) e não ordenadas (tópicos).<table>
, <tr>
, <th>
, <td>
são tags para os elementos de uma tabela, como as linhas, as cédulas do cabeçalho e do corpo da tabela.<img src="/endereço/da/imagem.png">
utilizado para exibir imagens.<div>
e <span>
são elementos de organização que permitem agrupar outros elementos para facilitar a aplicação de formação CSS.<meta title="Título">
é usado para armazenar metadados da página.<script>
usado para contér código que habilitam funcionalidades no HTML, geralmente JavaScript.<link>
serve para incluir arquivos externos, geralmente CSS.htmlParse()
preparada para o menor rigor do HTML.library(XML)
url <- "http://leg.ufpr.br/~walmes/ensino/"
h <- htmlParse(url)
summary(h)
## $nameCounts
##
## td a tr img th hr address body
## 210 46 45 43 7 2 1 1
## h1 head html table title
## 1 1 1 1 1
##
## $numNodes
## [1] 360
h
## <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
## <html>
## <head><title>Index of /~walmes/ensino</title></head>
## <body>
## <h1>Index of /~walmes/ensino</h1>
## <table>
## <tr>
## <th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th>
## <th><a href="?C=N;O=D">Name</a></th>
## <th><a href="?C=M;O=A">Last modified</a></th>
## <th><a href="?C=S;O=A">Size</a></th>
## <th><a href="?C=D;O=A">Description</a></th>
## </tr>
## <tr><th colspan="5"><hr></th></tr>
## <tr>
## <td valign="top"><img src="/icons/back.gif" alt="[PARENTDIR]"></td>
## <td><a href="/~walmes/">Parent Directory</a></td>
## <td> </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="CEQ/">CEQ/</a></td>
## <td align="right">2019-03-27 18:50 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="CPI/">CPI/</a></td>
## <td align="right">2018-12-18 12:13 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="EC2/">EC2/</a></td>
## <td align="right">2018-12-14 17:42 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ML/">ML/</a></td>
## <td align="right">2018-12-07 10:43 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/layout.gif" alt="[ ]"></td>
## <td><a href="Modelos-lineares-Joao-Gil-de-Luna.pdf">Modelos-lineares-Joao-Gil-de-Luna.pdf</a></td>
## <td align="right">2016-08-14 21:40 </td>
## <td align="right">535K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="af722-2014-01/">af722-2014-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="apv7037/">apv7037/</a></td>
## <td align="right">2016-11-25 13:38 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce001mb-2014-01/">ce001mb-2014-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce001mb-2015-01/">ce001mb-2015-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce002b-2011-02/">ce002b-2011-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce002b-2012-01/">ce002b-2012-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce003-n2n3-2010-02/">ce003-n2n3-2010-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce003b-2011-01/">ce003b-2011-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce063-2015-01/">ce063-2015-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce064-2015-02/">ce064-2015-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce071-2014-01/">ce071-2014-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce074-2012-02/">ce074-2012-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2011-02/">ce083-2011-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2012-01/">ce083-2012-01/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2012-02/">ce083-2012-02/</a></td>
## <td align="right">2016-08-10 10:03 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2013-01/">ce083-2013-01/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2013-02/">ce083-2013-02/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2014-01/">ce083-2014-01/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce083-2014-02/">ce083-2014-02/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce089-2014-02/">ce089-2014-02/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce089-2015-02/">ce089-2015-02/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce213-2015-01/">ce213-2015-01/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="ce223-2011-01/">ce223-2011-01/</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="dsbd/">dsbd/</a></td>
## <td align="right">2018-06-30 13:29 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td>
## <td><a href="e213-2015-01">e213-2015-01</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right">796K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="extensoes/">extensoes/</a></td>
## <td align="right">2017-10-30 18:57 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="leg-mapa.png">leg-mapa.png</a></td>
## <td align="right">2016-08-10 10:35 </td>
## <td align="right">1.4M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="leg_mapa.jpg">leg_mapa.jpg</a></td>
## <td align="right">2016-10-13 09:31 </td>
## <td align="right">193K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="legmapa.png">legmapa.png</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right">5.2M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="mintex/">mintex/</a></td>
## <td align="right">2019-03-26 20:42 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="passos_datafilehost.png">passos_datafilehost.png</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> 85K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/image2.gif" alt="[IMG]"></td>
## <td><a href="passos_discussao.png">passos_discussao.png</a></td>
## <td align="right">2016-08-10 10:04 </td>
## <td align="right"> 45K</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="pesq-reprod/">pesq-reprod/</a></td>
## <td align="right">2018-07-05 18:44 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="teste/">teste/</a></td>
## <td align="right">2018-04-19 18:33 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/text.gif" alt="[TXT]"></td>
## <td><a href="visualizacao-dados.html">visualizacao-dados.html</a></td>
## <td align="right">2018-04-11 15:17 </td>
## <td align="right"> 18M</td>
## <td> </td>
## </tr>
## <tr>
## <td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td>
## <td><a href="web-scraping/">web-scraping/</a></td>
## <td align="right">2018-02-26 00:10 </td>
## <td align="right"> - </td>
## <td> </td>
## </tr>
## <tr><th colspan="5"><hr></th></tr>
## </table>
## <address>Apache/2.4.10 (Debian) Server at leg.ufpr.br Port 80</address>
## </body>
## </html>
##
# Extraí as tabelas: <table>.
tb <- readHTMLTable(h)
tb
## $`NULL`
## Name Last modified Size
## 1 <NA> <NA> <NA>
## 2 Parent Directory -
## 3 CEQ/ 2019-03-27 18:50 -
## 4 CPI/ 2018-12-18 12:13 -
## 5 EC2/ 2018-12-14 17:42 -
## 6 ML/ 2018-12-07 10:43 -
## 7 Modelos-lineares-Joao-Gil-de-Luna.pdf 2016-08-14 21:40 535K
## 8 af722-2014-01/ 2016-08-10 10:03 -
## 9 apv7037/ 2016-11-25 13:38 -
## 10 ce001mb-2014-01/ 2016-08-10 10:03 -
## 11 ce001mb-2015-01/ 2016-08-10 10:03 -
## 12 ce002b-2011-02/ 2016-08-10 10:03 -
## 13 ce002b-2012-01/ 2016-08-10 10:03 -
## 14 ce003-n2n3-2010-02/ 2016-08-10 10:03 -
## 15 ce003b-2011-01/ 2016-08-10 10:03 -
## 16 ce063-2015-01/ 2016-08-10 10:03 -
## 17 ce064-2015-02/ 2016-08-10 10:03 -
## 18 ce071-2014-01/ 2016-08-10 10:03 -
## 19 ce074-2012-02/ 2016-08-10 10:03 -
## 20 ce083-2011-02/ 2016-08-10 10:03 -
## 21 ce083-2012-01/ 2016-08-10 10:03 -
## 22 ce083-2012-02/ 2016-08-10 10:03 -
## 23 ce083-2013-01/ 2016-08-10 10:04 -
## 24 ce083-2013-02/ 2016-08-10 10:04 -
## 25 ce083-2014-01/ 2016-08-10 10:04 -
## 26 ce083-2014-02/ 2016-08-10 10:04 -
## 27 ce089-2014-02/ 2016-08-10 10:04 -
## 28 ce089-2015-02/ 2016-08-10 10:04 -
## 29 ce213-2015-01/ 2016-08-10 10:04 -
## 30 ce223-2011-01/ 2016-08-10 10:04 -
## 31 dsbd/ 2018-06-30 13:29 -
## 32 e213-2015-01 2016-08-10 10:04 796K
## 33 extensoes/ 2017-10-30 18:57 -
## 34 leg-mapa.png 2016-08-10 10:35 1.4M
## 35 leg_mapa.jpg 2016-10-13 09:31 193K
## 36 legmapa.png 2016-08-10 10:04 5.2M
## 37 mintex/ 2019-03-26 20:42 -
## 38 passos_datafilehost.png 2016-08-10 10:04 85K
## 39 passos_discussao.png 2016-08-10 10:04 45K
## 40 pesq-reprod/ 2018-07-05 18:44 -
## 41 teste/ 2018-04-19 18:33 -
## 42 visualizacao-dados.html 2018-04-11 15:17 18M
## 43 web-scraping/ 2018-02-26 00:10 -
## 44 <NA> <NA> <NA>
## Description
## 1 <NA>
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37
## 38
## 39
## 40
## 41
## 42
## 43
## 44 <NA>
# Extraí todos os hiperlinks: <a href>.
lk <- getHTMLLinks(h)
lk
## [1] "?C=N;O=D"
## [2] "?C=M;O=A"
## [3] "?C=S;O=A"
## [4] "?C=D;O=A"
## [5] "/~walmes/"
## [6] "CEQ/"
## [7] "CPI/"
## [8] "EC2/"
## [9] "ML/"
## [10] "Modelos-lineares-Joao-Gil-de-Luna.pdf"
## [11] "af722-2014-01/"
## [12] "apv7037/"
## [13] "ce001mb-2014-01/"
## [14] "ce001mb-2015-01/"
## [15] "ce002b-2011-02/"
## [16] "ce002b-2012-01/"
## [17] "ce003-n2n3-2010-02/"
## [18] "ce003b-2011-01/"
## [19] "ce063-2015-01/"
## [20] "ce064-2015-02/"
## [21] "ce071-2014-01/"
## [22] "ce074-2012-02/"
## [23] "ce083-2011-02/"
## [24] "ce083-2012-01/"
## [25] "ce083-2012-02/"
## [26] "ce083-2013-01/"
## [27] "ce083-2013-02/"
## [28] "ce083-2014-01/"
## [29] "ce083-2014-02/"
## [30] "ce089-2014-02/"
## [31] "ce089-2015-02/"
## [32] "ce213-2015-01/"
## [33] "ce223-2011-01/"
## [34] "dsbd/"
## [35] "e213-2015-01"
## [36] "extensoes/"
## [37] "leg-mapa.png"
## [38] "leg_mapa.jpg"
## [39] "legmapa.png"
## [40] "mintex/"
## [41] "passos_datafilehost.png"
## [42] "passos_discussao.png"
## [43] "pesq-reprod/"
## [44] "teste/"
## [45] "visualizacao-dados.html"
## [46] "web-scraping/"
<head>
e o <body>
.<body>
esta conteúdo a ser exibido.<head>
estão as definições de funcionamento e metadados.