92 lines
4.1 KiB
Markdown
92 lines
4.1 KiB
Markdown
---
|
|
Title: wget/curl
|
|
Date: 2022-07-25 13:45 CEST
|
|
Author: Fabrice
|
|
Category: cheat sheets
|
|
Tags: wget, curl, cli
|
|
Slug: wget-curl
|
|
Header_Cover: images/covers/speedboat.jpg
|
|
Summary: Some useful wget and curl commands, such as downloading a repository.
|
|
Lang: en
|
|
---
|
|
|
|
# wget or curl?
|
|
|
|
`wget` is a tool to download contents from the command line.
|
|
In its basic form, it allows downloading a file quite easily just by typing `wget <url>` in your favorite terminal.
|
|
|
|
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
|
|
|
|
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports more protocols than `wget` which only handles https(s) and ftp requests.
|
|
|
|
On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
|
|
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
|
|
To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
|
|
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
|
|
|
|
To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
|
|
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
|
|
|
|
# wget
|
|
|
|
## Download a full repository
|
|
|
|
Download a repository selecting specific files
|
|
```sh
|
|
wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>
|
|
```
|
|
|
|
Where `<n>` denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “<https://blog.epheme.re/images/covers/>”, you can put:
|
|
```sh
|
|
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
|
|
```
|
|
|
|
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
|
|
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
|
|
|
|
## Mirror a website
|
|
|
|
Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
|
|
```sh
|
|
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
|
|
```
|
|
|
|
The name of options are quite straightforward, and the shorten version of it is: `wget -mkEp -np <url>`
|
|
|
|
### Ignoring robots.txt
|
|
|
|
Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) forbids you the access to some resources. You can easily bypass this with the option `-e robots=off`.
|
|
|
|
### Number of tries
|
|
|
|
Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries`/`-t` option.
|
|
|
|
## Finding 404 on a website
|
|
|
|
Using the `--spider` option to not actually download files, you can use it as a debugger for your website with `--output-file`/`-o` to log the result in a file.
|
|
|
|
```sh
|
|
wget --spider -r -nd -o <logfile> <url>
|
|
```
|
|
|
|
The list of broken links is then summarized at the end of the log file.
|
|
|
|
# Curl
|
|
|
|
## Send a POST request
|
|
|
|
My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/`-F` option:
|
|
|
|
```sh
|
|
curl -F <field1>=<content1> -F <field2>=<content2> <url>
|
|
```
|
|
|
|
Note that to send a file, precede the filename with an `@`:
|
|
|
|
```sh
|
|
curl -F picture=@face.jpg <url>
|
|
```
|
|
|
|
<!-- vim: spl=en
|
|
-->
|