blog/content/cheat-sheets/wget.md

92 lines
4.1 KiB
Markdown

---
Title: wget/curl
Date: 2022-07-25 13:45 CEST
Author: Fabrice
Category: cheat sheets
Tags: wget, curl
Slug: wget-curl
Header_Cover: images/covers/speedboat.jpg
Summary: Some useful wget and curl commands, such as downloading a repository.
Lang: en
---
# wget or curl?
`wget` is a tool to download contents from the command line.
In its basic form, it allows downloading a file quite easily just by typing `wget <url>` in your favorite terminal.
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports more protocols than `wget` which only handles https(s) and ftp requests.
On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
# wget
## Download a full repository
Download a repository selecting specific files
```sh
wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>
```
Where `<n>` denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “<https://blog.epheme.re/images/covers/>”, you can put:
```sh
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
```
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
## Mirror a website
Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
```sh
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
```
The name of options are quite straightforward, and the shorten version of it is: `wget -mkEp -np <url>`
### Ignoring robots.txt
Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) forbids you the access to some resources. You can easily bypass this with the option `-e robots=off`.
### Number of tries
Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries`/`-t` option.
## Finding 404 on a website
Using the `--spider` option to not actually download files, you can use it as a debugger for your website with `--output-file`/`-o` to log the result in a file.
```sh
wget --spider -r -nd -o <logfile> <url>
```
The list of broken links is then summarized at the end of the log file.
# Curl
## Send a POST request
My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/`-F` option:
```sh
curl -F <field1>=<content1> -F <field2>=<content2> <url>
```
Note that to send a file, precede the filename with an `@`:
```sh
curl -F picture=@face.jpg <url>
```
<!-- vim: spl=en
-->