Prof. Walmes Marques Zeviani
14 Mar 2017
Predicado é tudo o que se declara acerca do sujeito (…) (http://brasilescola.uol.com.br/gramatica/tipos-predicado.htm).
Atributo ou propriedade característica de uma coisa (https://www.dicio.com.br/predicado/).
//div[<predicado>]
//ul/li[2]
: pega o segundo elemento <li>
de listas não ordenadas <ul>
.
library(XML)
s <- '
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>'
h <- xmlParse(s)
xpathApply(h, "//ul/li[2]")
## [[1]]
## <li>Tea</li>
##
## attr(,"class")
## [1] "XMLNodeSet"
//tr[./th]
: pega as linha da tabela <tr>
(table row) que possuem elementos filhos sendo cédulas de cabeçalho <th>
(table header).
s <- '
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr class="odd">
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr class="even">
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>'
h <- xmlParse(s)
xpathApply(h, "//tr[./th]")
## [[1]]
## <tr>
## <th>Firstname</th>
## <th>Lastname</th>
## <th>Age</th>
## </tr>
##
## attr(,"class")
## [1] "XMLNodeSet"
//tr[@class]
: pega as linhas da tabela que possuem o atributo class
.
xpathApply(h, "//tr[@class]")
## [[1]]
## <tr class="odd">
## <td>Jill</td>
## <td>Smith</td>
## <td>50</td>
## </tr>
##
## [[2]]
## <tr class="even">
## <td>Eve</td>
## <td>Jackson</td>
## <td>94</td>
## </tr>
##
## attr(,"class")
## [1] "XMLNodeSet"
//tr[@class='even']
: pega as linhas da tabela que possuem o atributo class
com valor igual a even
(linha ímpar).
xpathApply(h, "//tr[@class='even']")
## [[1]]
## <tr class="even">
## <td>Eve</td>
## <td>Jackson</td>
## <td>94</td>
## </tr>
##
## attr(,"class")
## [1] "XMLNodeSet"
s <- '
<div>
<ul type="bolo" class="ingred">
<li>3 ovos</li>
<li>150 ml de leite</li>
<li>3 colheres de manteiga</li>
<li>Massa preparada de bolo</li>
</ul>
<ul class="preparo">
<li>Deixe o forno preaquecer por 20 min à 180 graus</li>
<li>Bata os ingredientes até uniformizar</li>
<li>Unte a forma</li>
<li>Despeje na forma e leve ao fogo por aproximadamente 1 hora</li>
</ul>
</div>'
h <- xmlParse(s)
xpathApply(h, path = "//ul[2]")
xpathApply(h, path = "//ul/li[2]")
xpathApply(h, path = "//ul/li[position() = 2]")
xpathApply(h, path = "//ul/li[position() = last()]")
xpathApply(h, path = "//ul/li[last()]")
xpathApply(h, path = "//ul/li[position() != last()]")
xpathApply(h, path = "//ul/li[position() < last()]")
xpathApply(h, path = "//ul[count(li) >= 3]")
xpathApply(h, path = "//ul[@type]")
xpathApply(h, path = "//ul[not(@type)]")
xpathApply(h, path = "//ul[@class = 'preparo']")
xpathApply(h, path = "//ul[@class != 'preparo']")
xpathApply(h, path = "//ul[@type and @class]")
xpathApply(h, path = "//ul[@class = 'ingred' or @class = 'preparo']")
# Equivalentes.
xpathApply(h, path = "//ul/li")
xpathApply(h, path = "//ul/child::li")
# Equivalentes.
xpathApply(h, path = "//li/parent::ul")
xpathApply(h, path = "//ul[li]")
# Equivalentes.
xpathApply(h, path = "//div//li")
xpathApply(h, path = "//div/descendant::li")
xpathApply(h, path = "//div/descendant-or-self::li")
# Equivalentes.
xpathApply(h, path = "//div/ul")
xpathApply(h, path = "//div/child::ul")
xpathApply(h, path = "//div/descendant::ul")
xpathApply(h, path = "//div/descendant-or-self::ul")
# Não são equivalentes.
xpathApply(h, path = "//li/ancestor::div")
xpathApply(h, path = "//div//li")
# Equivalentes.
xpathApply(h, path = "//ul/following-sibling::ul")
xpathApply(h, path = "//ul[1]/following-sibling::ul")
# Equivalentes.
xpathApply(h, path = "//ul/preceding-sibling::ul")
xpathApply(h, path = "//ul[2]/preceding-sibling::ul")
# Não equivalentes.
xpathApply(h, path = "//ul[1]/li[1]/following-sibling::li")
xpathApply(h, path = "//ul[1]/li[last()]/preceding-sibling::li")
s <- '
<div>
<div>
<h1>Titulo</h1>
</div>
<div>
<h3>links</h3>
<a href="config/notas.html">Clique aqui</a>
<img src="images/graph.png"/>
</div>
</div>'
h <- xmlParse(s)
xpathApply(h, "//a[text()]")
xpathApply(h, "//a[text() = 'Clique aqui']")
xpathApply(h, "//a[starts-with(text(), 'Clique')]")
xpathApply(h, "//a[contains(text(), 'aqui')]")
xpathApply(h, "//img[contains(@src, '.png')]")
xpathApply(h, "//a[contains(@href, '.html')]")
xpathApply(h, "//div/*[starts-with(name(), 'h')]")
# A libxml 1.0 não possui a função ends-with() que é da versão 2.0.
xpathApply(h, "//img[ends-with(@src, '.png')]")
Figura 1: MUNZERT et al. (2015), página 87.
MUNZERT, S.; RUBBA, C.; MEIßNER, P.; NYHUIS, D. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Wiley, 2015.