blog/content/software/xidel.md

4.4 KiB
Raw Blame History

Title Date Lang Author Category Tags Slug table-of-contents Header_Cover Summary
Manipulate XML/HTML with Xidel 2023-10-29 22:00 en Fabrice software xidel, html, xml, cli xidel false ../images/covers/aix-en-provence.jpg An example-based approach on how to easily parse XML/HTML files and stubs with Xidel.

You may know jq process json files in command line. At some point I was looking for the simplicity of such a swiss-knife tool for XML/HTML, mostly for simple usages that don't require me to resort to a full-fledged scripting language such as python or dabbing in regular expressions that will never work because of a carriage return at an unexpected place, and guess what? It exists!

This tool is Xidel. It is a bit more than that as it also allows downloading files, which enables extra features such as navigating a site following specific links. You can find more about it in the list of examples given in the project website, which is a nice introduction to the possibilities of the tool.

However, I mainly use it for simple cases, where I mix-and-match the best of both worlds: a graphical client (such as firefox), and a CLI tool, which in this case is Xidel.

To do this, we will see a simple use case, where filtering by hand can be a bit tedious. Let us assume that we want to obtain the URL list of pdf versions of Victor Hugo's novels in French from Wikisource if available.

We start from this page: https://fr.wikisource.org/wiki/Auteur:Victor_Hugo, that lists which is available on https://fr.wikisource.org.

Now, we can simply select the “Romans” section as it is and copy it. Normally you can check that you indeed have the html in your clipboard by typing wl-paste -t text/html on wayland or xclip -selection clipboard -o -t text/html on X11 if you have xclip installed. In the following we will assume a Wayland environment with wl-clipboard, but it should also work with xclip (not tested, please let me know how it behaves).

Now that's good, but we now need to filter and parse it, we can start with a simple test:

wl-paste -t text/html | xidel -e '//a/@href'

Which will show us the target of each links in our selection. To explain the syntax, the option -e tells xidel to extract the content that is passed as input, which is either a template or following the XPath syntax to parse the DOM tree. In the above example we used the latter, to obtain every anchors (//a) and then their href attribute with @href. From there we can see that pdf versions contains the string… “pdf”. Now, we can see another nice part of XPath, is that we can filter using functions:

wl-paste -t text/html | xidel -e '//a/@href[contains(., "pdf")]'

The last magical part here, is the dot notation, which refers to the current item “value”. Im not the most familiar with the subtleties here, and you can refer to this stackoverflow short answer or long answer just above for more details.

You can also edit the way the filtering is done, for instance if the anchors you are targeting are named “Download”, you can obtain the links with:

wl-paste -t text/html | xidel -e '//a[contains(., "Download")]/@href'

If you want strict equality because there are “Download PDF” and “Download epub” links for instance:

wl-paste -t text/html | xidel -e '//a[text()="Download PDF"]/@href'

To go further, you can also pass HTTP headers and cookies to xidel via the --header/-H and --load-cookies options respectively. It is also possible to use the --follow/-f command to hop in the pages that matches (using the same syntax as above) to obtain a link from it… or event directly download it with the --download option and so on.

In this blogpost we only look at a local version of pre-filtered content using you web browser, but the possibilities are endless!