diff --git a/content/software/xidel.md b/content/software/xidel.md new file mode 100644 index 0000000..40ab202 --- /dev/null +++ b/content/software/xidel.md @@ -0,0 +1,97 @@ +--- +Title: Manipulate XML/HTML with Xidel +Date: 2023-10-29 22:00 +Lang: en +Author: Fabrice +Category: software +Tags: xidel, html, xml, cli +Slug: xidel +table-of-contents: false +Header_Cover: +Summary: Some information I would have love to learn earlier in (Neo)Vim. +--- + +You may know [jq](https://jqlang.github.io/jq/) process +[json](https://www.json.org/json-en.html) files in command line. At some point I +was looking for the simplicity of such a swiss-knife tool for +[XML](https://www.w3.org/XML/)/[HTML](https://html.spec.whatwg.org/multipage/), +mostly for simple usages that don't require me to resort to a full-fledged +scripting language such as [python](https://python.org) or dabbing in [regular +expressions](https://en.wikipedia.org/wiki/Regular_expression) that will never +work because of a carriage return at an unexpected place, and guess what? It exists! + +This tool is [xidel](https://www.videlibri.de/xidel.html). It is a bit more than +that as it also allows downloading files, which enables extra features such as +navigating a site following specific links. You can find more about it in the +[list of examples](https://www.videlibri.de/xidel.html#examples) given in the +project website, which is a nice introduction to the possibilities of the tool. + +However, I mainly use it for simple cases, where I mix-and-match the best of +both worlds: a graphical client (such as +[firefox](https://www.mozilla.org/en-US/firefox/new/)), and a CLI tool, which in +this case is xidel. + +To do this, we will see a simple use case, where filtering by hand can be a bit +tedious. Let us assume that we want to obtain the URL list of pdf versions of +Victor Hugo's novels in French from Wikisource if available. + +We start from this page: , +that lists which is available on . + +Now, we can simply select the “Romans” section as it is and copy it. Normally +you can check that you indeed have the html in your clipboard by typing +`wl-paste -t text/html` on wayland or `xclip -selection clipboard -o -t +text/html` on X11 if you have xclip installed. In the following we will assume a +Wayland environment with +[wl-clipboard](https://github.com/bugaevc/wl-clipboard), but it should also work +with `xclip` (not tested, please let me know how it behaves). + +Now that's good, but we now need to filter and parse it, we can start with a +simple test: + +```bash +wl-paste -t text/html | xidel -e '//a/@href' +``` + +Which will show us the target of each links in our selection. To explain the +syntax, the option `-e` tells `xidel` to extract the content that is passed as +input, which is either a +[template](https://benibela.de/documentation/internettools/extendedhtmlparser.THtmlTemplateParser.html) +or following the [XPath](https://en.wikipedia.org/wiki/XPath) syntax to parse +the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) tree. In the +above example we used the latter, to obtain every anchors (`//a`) and then their +`href` attribute with `@href`. +From there we can see that pdf versions contains the string… “pdf”. +Now, we can see another nice part of XPath, is that we can filter using +functions: + +```bash +wl-paste -t text/html | xidel -e '//a/@href[contains(., "pdf")]' +``` + +The last magical part here, is the dot notation, which refers to the current +item “value”. I’m not the most familiar with the subtleties here, and you can +refer to this stackoverflow [short answer](https://stackoverflow.com/a/38240971) +or long answer just above for more details. + +You can also edit the way the filtering is done, for instance if the anchors you +are targeting are named “Download”, you can obtain the links with: +```bash +wl-paste -t text/html | xidel -e '//a[contains(., "Download")]/@href' +``` + +If you want strict equality because there are “Download PDF” and “Download epub” +links for instance: + +```bash +wl-paste -t text/html | xidel -e '//a[text()="Download PDF"]/@href' +``` + +To go further, you can also pass HTTP headers and cookies to `xidel` via the +`--header/-H` and `--load-cookies` options respectively. It is also possible to +use the `--follow/-f` command to hop in the pages that matches (using the same +syntax as above) to obtain a link from it… or event directly download it with +the `--download` option and so on. + +In this blogpost we only look at a local version of pre-filtered content using +you web browser, but the possibilities are endless!