blog/content/cheat-sheets/wget.md

92 lines
4.1 KiB
Markdown
Raw Normal View History

2022-07-25 11:46:33 +00:00
---
Title: wget/curl
Date: 2022-07-25 13:45 CEST
Author: Fabrice
Category: cheat sheets
Tags: wget, curl
Slug: wget-curl
Header_Cover: images/covers/speedboat.jpg
Summary: Some useful wget and curl commands, such as downloading a repository.
Lang: en
---
# wget or curl?
`wget` is a tool to download contents from the command line.
In its basic form, it allows downloading a file quite easily just by typing `wget <url>` in your favorite terminal.
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
2022-07-26 08:24:43 +00:00
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports more protocols than `wget` which only handles https(s) and ftp requests.
2022-07-25 11:46:33 +00:00
2022-07-26 08:24:43 +00:00
On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
2022-07-25 11:46:33 +00:00
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
2022-07-26 08:24:43 +00:00
To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
2022-07-25 11:46:33 +00:00
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
2022-07-26 08:24:43 +00:00
To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
2022-07-25 11:46:33 +00:00
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
# wget
## Download a full repository
Download a repository selecting specific files
```sh
wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>
```
Where `<n>` denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “<https://blog.epheme.re/images/covers/>”, you can put:
```sh
2022-07-25 12:12:01 +00:00
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
2022-07-25 11:46:33 +00:00
```
2022-07-26 08:24:43 +00:00
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
2022-07-25 11:46:33 +00:00
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
## Mirror a website
2022-07-26 08:24:43 +00:00
Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
2022-07-25 11:46:33 +00:00
```sh
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
```
The name of options are quite straightforward, and the shorten version of it is: `wget -mkEp -np <url>`
2022-07-26 08:26:57 +00:00
### Ignoring robots.txt
2022-07-25 11:46:33 +00:00
Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) forbids you the access to some resources. You can easily bypass this with the option `-e robots=off`.
### Number of tries
2022-07-26 08:26:04 +00:00
Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries`/`-t` option.
2022-07-25 11:46:33 +00:00
## Finding 404 on a website
2022-07-26 08:26:04 +00:00
Using the `--spider` option to not actually download files, you can use it as a debugger for your website with `--output-file`/`-o` to log the result in a file.
2022-07-25 11:46:33 +00:00
```sh
wget --spider -r -nd -o <logfile> <url>
```
The list of broken links is then summarized at the end of the log file.
# Curl
## Send a POST request
2022-07-26 08:26:04 +00:00
My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/`-F` option:
2022-07-25 11:46:33 +00:00
```sh
curl -F <field1>=<content1> -F <field2>=<content2> <url>
```
Note that to send a file, precede the filename with an `@`:
```sh
curl -F picture=@face.jpg <url>
```
<!-- vim: spl=en
-->