Compare commits
8 Commits
d9e2205553
...
04f8994933
Author | SHA1 | Date | |
---|---|---|---|
04f8994933 | |||
6a2670c94d | |||
7e05fb1432 | |||
3912a2a74c | |||
1aff1e7336 | |||
ecd50343a4 | |||
d79c9e9d47 | |||
6cf9edd955 |
@ -4,7 +4,7 @@ Date: 2019-04-22 17:00
|
|||||||
Modified: 2023-05-14 20:00+02:00
|
Modified: 2023-05-14 20:00+02:00
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: antisèches
|
Category: antisèches
|
||||||
Tags: git, termtosvg
|
Tags: git, termtosvg, cli
|
||||||
Slug: git-tricks
|
Slug: git-tricks
|
||||||
Header_Cover: ../images/covers/water.jpg
|
Header_Cover: ../images/covers/water.jpg
|
||||||
Summary: Une compilation de commandes git que j’utilise ponctuellement
|
Summary: Une compilation de commandes git que j’utilise ponctuellement
|
||||||
|
@ -4,7 +4,7 @@ Date: 2019-04-22 17:00
|
|||||||
Modified: 2023-05-14 20:00+2:00
|
Modified: 2023-05-14 20:00+2:00
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: cheat sheets
|
Category: cheat sheets
|
||||||
Tags: git, termtosvg
|
Tags: git, termtosvg, cli
|
||||||
Slug: git-tricks
|
Slug: git-tricks
|
||||||
Header_Cover: images/covers/water.jpg
|
Header_Cover: images/covers/water.jpg
|
||||||
Summary: A compilation of some `git` tricks I keep forgetting.
|
Summary: A compilation of some `git` tricks I keep forgetting.
|
||||||
|
@ -3,7 +3,7 @@ Title: wget/curl
|
|||||||
Date: 2022-07-25 13:45 CEST
|
Date: 2022-07-25 13:45 CEST
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: cheat sheets
|
Category: cheat sheets
|
||||||
Tags: wget, curl
|
Tags: wget, curl, cli
|
||||||
Slug: wget-curl
|
Slug: wget-curl
|
||||||
Header_Cover: ../images/covers/speedboat.jpg
|
Header_Cover: ../images/covers/speedboat.jpg
|
||||||
Summary: Quelques commandes wget et curl utiles dans la vie de tous les jours.
|
Summary: Quelques commandes wget et curl utiles dans la vie de tous les jours.
|
||||||
|
@ -3,7 +3,7 @@ Title: wget/curl
|
|||||||
Date: 2022-07-25 13:45 CEST
|
Date: 2022-07-25 13:45 CEST
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: cheat sheets
|
Category: cheat sheets
|
||||||
Tags: wget, curl
|
Tags: wget, curl, cli
|
||||||
Slug: wget-curl
|
Slug: wget-curl
|
||||||
Header_Cover: images/covers/speedboat.jpg
|
Header_Cover: images/covers/speedboat.jpg
|
||||||
Summary: Some useful wget and curl commands, such as downloading a repository.
|
Summary: Some useful wget and curl commands, such as downloading a repository.
|
||||||
|
BIN
content/images/covers/aix-en-provence.jpg
Normal file
BIN
content/images/covers/aix-en-provence.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 553 KiB |
BIN
content/images/covers/orgue.jpg
Normal file
BIN
content/images/covers/orgue.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 511 KiB |
@ -1,7 +1,7 @@
|
|||||||
---
|
---
|
||||||
Title: Neovim as a LaTex Development Environment
|
Title: Neovim as a LaTex Development Environment
|
||||||
Date: 2023-10-14 12:00:00+0200
|
Date: 2023-10-14 12:00:00+0200
|
||||||
Date: 2023-10-14 17:00:00+0200
|
Modified: 2023-10-14 17:00:00+0200
|
||||||
Lang: en
|
Lang: en
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: software
|
Category: software
|
||||||
|
@ -5,7 +5,7 @@ Modified: 2019-04-24 11:12
|
|||||||
Lang: fr
|
Lang: fr
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: programmes
|
Category: programmes
|
||||||
Tags: pass, git
|
Tags: pass, git, cli
|
||||||
Slug: password-store
|
Slug: password-store
|
||||||
Header_Cover: ../images/covers/clovers.jpg
|
Header_Cover: ../images/covers/clovers.jpg
|
||||||
Summary: Un gestionnaire de mots de passe simple qui repose sur gpg, et synchronisé via git.
|
Summary: Un gestionnaire de mots de passe simple qui repose sur gpg, et synchronisé via git.
|
||||||
|
@ -4,7 +4,7 @@ Date: 2019-04-22 19:00
|
|||||||
Modified: 2019-04-23 14:24
|
Modified: 2019-04-23 14:24
|
||||||
Author: Fabrice
|
Author: Fabrice
|
||||||
Category: software
|
Category: software
|
||||||
Tags: pass, git
|
Tags: pass, git, cli
|
||||||
Slug: password-store
|
Slug: password-store
|
||||||
Header_Cover: images/covers/clovers.jpg
|
Header_Cover: images/covers/clovers.jpg
|
||||||
Summary: A simple password manager that relies on gpg, and synchronized with git.
|
Summary: A simple password manager that relies on gpg, and synchronized with git.
|
||||||
|
97
content/software/xidel.md
Normal file
97
content/software/xidel.md
Normal file
@ -0,0 +1,97 @@
|
|||||||
|
---
|
||||||
|
Title: Manipulate XML/HTML with Xidel
|
||||||
|
Date: 2023-10-29 22:00
|
||||||
|
Lang: en
|
||||||
|
Author: Fabrice
|
||||||
|
Category: software
|
||||||
|
Tags: xidel, html, xml, cli
|
||||||
|
Slug: xidel
|
||||||
|
table-of-contents: false
|
||||||
|
Header_Cover: ../images/covers/aix-en-provence.jpg
|
||||||
|
Summary: An example-based approach on how to easily parse XML/HTML files and stubs with Xidel.
|
||||||
|
---
|
||||||
|
|
||||||
|
You may know [jq](https://jqlang.github.io/jq/) process
|
||||||
|
[json](https://www.json.org/json-en.html) files in command line. At some point I
|
||||||
|
was looking for the simplicity of such a swiss-knife tool for
|
||||||
|
[XML](https://www.w3.org/XML/)/[HTML](https://html.spec.whatwg.org/multipage/),
|
||||||
|
mostly for simple usages that don't require me to resort to a full-fledged
|
||||||
|
scripting language such as [python](https://python.org) or dabbing in [regular
|
||||||
|
expressions](https://en.wikipedia.org/wiki/Regular_expression) that will never
|
||||||
|
work because of a carriage return at an unexpected place, and guess what? It exists!
|
||||||
|
|
||||||
|
This tool is [Xidel](https://www.videlibri.de/xidel.html). It is a bit more than
|
||||||
|
that as it also allows downloading files, which enables extra features such as
|
||||||
|
navigating a site following specific links. You can find more about it in the
|
||||||
|
[list of examples](https://www.videlibri.de/xidel.html#examples) given in the
|
||||||
|
project website, which is a nice introduction to the possibilities of the tool.
|
||||||
|
|
||||||
|
However, I mainly use it for simple cases, where I mix-and-match the best of
|
||||||
|
both worlds: a graphical client (such as
|
||||||
|
[firefox](https://www.mozilla.org/en-US/firefox/new/)), and a CLI tool, which in
|
||||||
|
this case is Xidel.
|
||||||
|
|
||||||
|
To do this, we will see a simple use case, where filtering by hand can be a bit
|
||||||
|
tedious. Let us assume that we want to obtain the URL list of pdf versions of
|
||||||
|
Victor Hugo's novels in French from Wikisource if available.
|
||||||
|
|
||||||
|
We start from this page: <https://fr.wikisource.org/wiki/Auteur:Victor_Hugo>,
|
||||||
|
that lists which is available on <https://fr.wikisource.org>.
|
||||||
|
|
||||||
|
Now, we can simply select the “Romans” section as it is and copy it. Normally
|
||||||
|
you can check that you indeed have the html in your clipboard by typing
|
||||||
|
`wl-paste -t text/html` on wayland or `xclip -selection clipboard -o -t
|
||||||
|
text/html` on X11 if you have xclip installed. In the following we will assume a
|
||||||
|
Wayland environment with
|
||||||
|
[wl-clipboard](https://github.com/bugaevc/wl-clipboard), but it should also work
|
||||||
|
with `xclip` (not tested, please let me know how it behaves).
|
||||||
|
|
||||||
|
Now that's good, but we now need to filter and parse it, we can start with a
|
||||||
|
simple test:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wl-paste -t text/html | xidel -e '//a/@href'
|
||||||
|
```
|
||||||
|
|
||||||
|
Which will show us the target of each links in our selection. To explain the
|
||||||
|
syntax, the option `-e` tells `xidel` to extract the content that is passed as
|
||||||
|
input, which is either a
|
||||||
|
[template](https://benibela.de/documentation/internettools/extendedhtmlparser.THtmlTemplateParser.html)
|
||||||
|
or following the [XPath](https://en.wikipedia.org/wiki/XPath) syntax to parse
|
||||||
|
the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) tree. In the
|
||||||
|
above example we used the latter, to obtain every anchors (`//a`) and then their
|
||||||
|
`href` attribute with `@href`.
|
||||||
|
From there we can see that pdf versions contains the string… “pdf”.
|
||||||
|
Now, we can see another nice part of XPath, is that we can filter using
|
||||||
|
functions:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wl-paste -t text/html | xidel -e '//a/@href[contains(., "pdf")]'
|
||||||
|
```
|
||||||
|
|
||||||
|
The last magical part here, is the dot notation, which refers to the current
|
||||||
|
item “value”. I’m not the most familiar with the subtleties here, and you can
|
||||||
|
refer to this stackoverflow [short answer](https://stackoverflow.com/a/38240971)
|
||||||
|
or long answer just above for more details.
|
||||||
|
|
||||||
|
You can also edit the way the filtering is done, for instance if the anchors you
|
||||||
|
are targeting are named “Download”, you can obtain the links with:
|
||||||
|
```bash
|
||||||
|
wl-paste -t text/html | xidel -e '//a[contains(., "Download")]/@href'
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want strict equality because there are “Download PDF” and “Download epub”
|
||||||
|
links for instance:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
wl-paste -t text/html | xidel -e '//a[text()="Download PDF"]/@href'
|
||||||
|
```
|
||||||
|
|
||||||
|
To go further, you can also pass HTTP headers and cookies to `xidel` via the
|
||||||
|
`--header/-H` and `--load-cookies` options respectively. It is also possible to
|
||||||
|
use the `--follow/-f` command to hop in the pages that matches (using the same
|
||||||
|
syntax as above) to obtain a link from it… or event directly download it with
|
||||||
|
the `--download` option and so on.
|
||||||
|
|
||||||
|
In this blogpost we only look at a local version of pre-filtered content using
|
||||||
|
you web browser, but the possibilities are endless!
|
Loading…
Reference in New Issue
Block a user