Small fixes in the wget article

This commit is contained in:
Fabrice Mouhartem 2022-07-26 10:24:43 +02:00
parent 37cf2ad7b7
commit a701bb221a
2 changed files with 10 additions and 10 deletions

View File

@ -13,7 +13,7 @@ Lang: fr
`wget` est un outil qui permet le téléchargement de manière non interactive de `wget` est un outil qui permet le téléchargement de manière non interactive de
contenu sur des sites via FTP/HTTP(s), etc. contenu sur des sites via FTP/HTTP(s), etc.
Dans son utilisation la plus basique, il permet de télécharger du contenu en Dans son utilisation la plus basique, il permet de télécharger du contenu en
tapant simplement `wget ${url}` dans son émulateur de terminal favoris. tapant simplement `wget ${url}` dans son émulateur de terminal favori.
Cependant, en parcourant sa [documentation](https://www.gnu.org/software/wget/manual/wget.html) permet de se rendre compte de sa puissance. Cependant, en parcourant sa [documentation](https://www.gnu.org/software/wget/manual/wget.html) permet de se rendre compte de sa puissance.
@ -76,7 +76,7 @@ web. Pour ce faire, la commande longe est:
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url> wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
``` ```
Le nom des potions est assez claire, et la version courte serait: `wget -mkEp -np <url>` Le nom des options est assez claire, et la version courte serait: `wget -mkEp -np <url>`
### Ignorer `robots.txt` ### Ignorer `robots.txt`

View File

@ -17,14 +17,14 @@ In its basic form, it allows downloading a file quite easily just by typing `wge
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is. However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports a lot more protocols than wget which only handles https(s) and ftp requests. Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports more protocols than `wget` which only handles https(s) and ftp requests.
On the other hand, wget can follow links (recursively), apply filters on your requests, transform relative links,… On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
Thus, they don't cover the same area of usage (even if the intersection is non-empty). Thus, they don't cover the same area of usage (even if the intersection is non-empty).
To put it short wget will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion. To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat). Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
To conclude, I'm definitely not a wget/curl poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis. To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/). Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
# wget # wget
@ -41,12 +41,12 @@ Where `<n>` denotes the number of subdirectories to omit from saving. For instan
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/ wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
``` ```
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories/-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories) Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type) To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
## Mirror a website ## Mirror a website
Another useful use of wget is just to make a local copy of a website. To do this, the long version is: Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
```sh ```sh
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url> wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
``` ```
@ -59,7 +59,7 @@ Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
### Number of tries ### Number of tries
Occasionally, when the server is busy answering you, wget will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries/-t` option. Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries/-t` option.
## Finding 404 on a website ## Finding 404 on a website
@ -75,7 +75,7 @@ The list of broken links is then summarized at the end of the log file.
## Send a POST request ## Send a POST request
My most frequent use of curl is to send POST requests to different kind of API, the syntax is quite simple using the `-F` option: My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/-F` option:
```sh ```sh
curl -F <field1>=<content1> -F <field2>=<content2> <url> curl -F <field1>=<content1> -F <field2>=<content2> <url>