92 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			92 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| Title: wget/curl
 | |
| Date: 2022-07-25 13:45 CEST
 | |
| Author: Fabrice
 | |
| Category: cheat sheets
 | |
| Tags: wget, curl, cli
 | |
| Slug: wget-curl
 | |
| Header_Cover: images/covers/speedboat.jpg
 | |
| Summary: Some useful wget and curl commands, such as downloading a repository.
 | |
| Lang: en
 | |
| ---
 | |
| 
 | |
| # wget or curl?
 | |
| 
 | |
| `wget` is a tool to download contents from the command line.
 | |
| In its basic form, it allows downloading a file quite easily just by typing `wget <url>` in your favorite terminal.
 | |
| 
 | |
| However, a simple look to the [man](https://www.gnu.org/software/wget/manual/wget.html) page directly shows how powerful this tool is.
 | |
| 
 | |
| Similarily, `curl` is another tool to handle internet requests, however, a look at the [man](https://curl.haxx.se/docs/manpage.html) page shows that it supports  more protocols than `wget` which only handles https(s) and ftp requests.
 | |
| 
 | |
| On the other hand, `wget` can follow links (recursively), apply filters on your requests, transform relative links,…
 | |
| Thus, they don't cover the same area of usage (even if the intersection is non-empty).
 | |
| To put it short `wget` will prove useful whenever you have to download a part of a website while exploring links, while curl can be very handy to tweak single requests in an atomic fashion.
 | |
| Moreover, if you want to analyze web information, firefox and chromium (I didn't try on other browsers) allows exporting requests directly as a curl command from the web inspector, which makes the job less painful than with [netcat](https://en.wikipedia.org/wiki/Netcat).
 | |
| 
 | |
| To conclude, I'm definitely not a `wget`/`curl` poweruser, so there may be very basic stuff here, but as I'm not using those tools on a daily basis.
 | |
| Anyway, as I said, this section is to help me remember these commands to [reduce my google requests](https://degooglisons-internet.org/en/).
 | |
| 
 | |
| # wget
 | |
| 
 | |
| ## Download a full repository
 | |
| 
 | |
| Download a repository selecting specific files
 | |
| ```sh
 | |
| wget --recursive --no-parent --no-host-directories --cut-dirs=<n> --accept <extension list> <url>
 | |
| ```
 | |
| 
 | |
| Where `<n>` denotes the number of subdirectories to omit from saving. For instance, to download the cover images from this blog at the address “<https://blog.epheme.re/images/covers/>”, you can put:
 | |
| ```sh
 | |
| wget -rnpnH --cut-dirs=2 -A jpg https://blog.epheme.re/
 | |
| ```
 | |
| 
 | |
| Anyhow, a simpler method, if you don't need the directory structure (for instance in the above example), is to use the `--no-directories`/`-nd` option. However, the cut-dirs can be useful if you need some architecture information (e.g., if the files are sorted in directories by date or categories)
 | |
| To reject some documents, you can also use the option `-R`, which also accepts regular expressions (which type can be specified using --regex-type)
 | |
| 
 | |
| ## Mirror a website
 | |
| 
 | |
| Another useful use of `wget` is just to make a local copy of a website. To do this, the long version is:
 | |
| ```sh
 | |
| wget --mirror --no-host-directories --convert-links --adjust-extension --page-requisites --no-parent <url>
 | |
| ```
 | |
| 
 | |
| The name of options are quite straightforward, and the shorten version of it is: `wget -mkEp -np <url>`
 | |
| 
 | |
| ### Ignoring robots.txt
 | |
| 
 | |
| Sometimes, [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) forbids you the access to some resources. You can easily bypass this with the option `-e robots=off`.
 | |
| 
 | |
| ### Number of tries
 | |
| 
 | |
| Occasionally, when the server is busy answering you, `wget` will try again and again (20 times by default), which can slower your mirroring quite a bit (especially if the timeout is big). You can lower this bound using the… `--tries`/`-t` option.
 | |
| 
 | |
| ## Finding 404 on a website
 | |
| 
 | |
| Using the `--spider` option to not actually download files, you can use it as a debugger for your website with `--output-file`/`-o` to log the result in a file.
 | |
| 
 | |
| ```sh
 | |
| wget --spider -r -nd -o <logfile> <url>
 | |
| ```
 | |
| 
 | |
| The list of broken links is then summarized at the end of the log file.
 | |
| 
 | |
| # Curl
 | |
| 
 | |
| ## Send a POST request
 | |
| 
 | |
| My most frequent use of `curl` is to send POST requests to different kind of API, the syntax is quite simple using the `--form`/`-F` option:
 | |
| 
 | |
| ```sh
 | |
| curl -F <field1>=<content1> -F <field2>=<content2> <url>
 | |
| ```
 | |
| 
 | |
| Note that to send a file, precede the filename with an `@`:
 | |
| 
 | |
| ```sh
 | |
| curl -F picture=@face.jpg <url>
 | |
| ```
 | |
| 
 | |
| <!-- vim: spl=en
 | |
| -->
 |