Small fixes in the wget
article
This commit is contained in:
parent
37cf2ad7b7
commit
a701bb221a
@ -13,7 +13,7 @@ Lang: fr
|
|||||||
`wget` est un outil qui permet le téléchargement de manière non interactive de
|
`wget` est un outil qui permet le téléchargement de manière non interactive de
|
||||||
contenu sur des sites via FTP/HTTP(s), etc.
|
contenu sur des sites via FTP/HTTP(s), etc.
|
||||||
Dans son utilisation la plus basique, il permet de télécharger du contenu en
|
Dans son utilisation la plus basique, il permet de télécharger du contenu en
|
||||||
tapant simplement `wget ${url}` dans son émulateur de terminal favoris.
|
tapant simplement `wget ${url}` dans son émulateur de terminal favori.
|
||||||
|
|
||||||
Cependant, en parcourant sa [documentation](https://www.gnu.org/software/wget/manual/wget.html) permet de se rendre compte de sa puissance.
|
Cependant, en parcourant sa [documentation](https://www.gnu.org/software/wget/manual/wget.html) permet de se rendre compte de sa puissance.
|
||||||
|
|
||||||
@ -76,7 +76,7 @@ web. Pour ce faire, la commande longe est :
|
|||||||
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
|
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
|
||||||
```
|
```
|
||||||
|
|
||||||
Le nom des potions est assez claire, et la version courte serait : `wget -mkEp -np <url>`
|
Le nom des options est assez claire, et la version courte serait : `wget -mkEp -np <url>`
|
||||||
|
|
||||||
### Ignorer `robots.txt`
|
### Ignorer `robots.txt`
|
||||||
|
|
||||||
|
@ -17,14 +17,14 @@ In its basic form, it allows downloading a file quite easily just by typing `wge
|
|||||||
|
|
||||||
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
|
However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
|
||||||
|
|
||||||
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports a lot more protocols than wget which only handles https(s) and ftp requests.
|
Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports more protocols than `wget` which only handles https(s) and ftp requests.
|
||||||
|
|
||||||
On the other hand, wget can follow links (recursively), apply filters on your requests, transform relative links,…
|
On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
|
||||||
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
|
Thus, they don't cover the same area of usage (even if the intersection is non-empty).
|
||||||
To put it short wget will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
|
To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
|
||||||
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
|
Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
|
||||||
|
|
||||||
To conclude, I'm definitely not a wget/curl poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
|
To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
|
||||||
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
|
Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
|
||||||
|
|
||||||
# wget
|
# wget
|
||||||
@ -41,12 +41,12 @@ Where `<n>` denotes the number of subdirectories to omit from saving. For instan
|
|||||||
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
|
wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
|
||||||
```
|
```
|
||||||
|
|
||||||
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories/-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
|
Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
|
||||||
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
|
To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
|
||||||
|
|
||||||
## Mirror a website
|
## Mirror a website
|
||||||
|
|
||||||
Another useful use of wget is just to make a local copy of a website. To do this, the long version is:
|
Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
|
||||||
```sh
|
```sh
|
||||||
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
|
wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
|
||||||
```
|
```
|
||||||
@ -59,7 +59,7 @@ Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard)
|
|||||||
|
|
||||||
### Number of tries
|
### Number of tries
|
||||||
|
|
||||||
Occasionally, when the server is busy answering you, wget will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries/-t` option.
|
Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries/-t` option.
|
||||||
|
|
||||||
## Finding 404 on a website
|
## Finding 404 on a website
|
||||||
|
|
||||||
@ -75,7 +75,7 @@ The list of broken links is then summarized at the end of the log file.
|
|||||||
|
|
||||||
## Send a POST request
|
## Send a POST request
|
||||||
|
|
||||||
My most frequent use of curl is to send POST requests to different kind of API, the syntax is quite simple using the `-F` option:
|
My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/-F` option:
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
curl -F <field1>=<content1> -F <field2>=<content2> <url>
|
curl -F <field1>=<content1> -F <field2>=<content2> <url>
|
||||||
|
Loading…
Reference in New Issue
Block a user