wget
GNU Wget (or just wget) is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get, connotative of its primary function. It supports downloading via HTTP, HTTPS, and FTP protocols, the most popular TCP/IP-based protocols used for web browsing.[1]
Its features include recursive download, conversion of links for offline viewing of local HTML, support for proxies, and much more. It appeared in 1996, coinciding with the boom of popularity of the Web, causing its wide use among Unix users and distribution with most major Linux distributions. Written in portable C, wget can be easily installed on any Unix-like system and has been ported to many environments.[1]
- GNU Project Homepage [EN]
- wget [EN] @ Fedora Package
- wget [EN] @ Homebrew Formula
Documentation
- GNU wget Manual [EN]
- man 1 'wget' [EN]
Syntax
wget [PARAMETER ...] [URL ...]
Parameters
- -A LIST, --accept=LIST
-R LIST, --reject=LIST - Specify comma-separated LISTs of file name suffixes or patterns to accept or reject.
- NOTE:If any of the wildcard characters, *, ?, [ or ], appear in an element of LIST, it will be treated as a pattern, rather than a suffix.
- -c, --continue
- Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of wget, or by another program.
- -E, --adjust-extension
- If a file of type application/xhtml, application/xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
- -F, --force-html
- When input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding <base href="URL"> to HTML, or using the --base command-line option.
- -i FILE, --input-file=FILE
- Read URLs from a local or external FILE. If - is specified as file, URLs are read from the standard input.
- If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line.
- -k, --convert-links
- After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
- -l DEPTH, --level=DEPTH
- Specify recursion maximum DEPTH level. Specify inf for infinite levels. The default maximum depth is 5.
- --load-cookies=FILE
- Load cookies from FILE before the first HTTP retrieval. FILE is a textual file in the format originally used by Netscape's cookies.txt file.
- -m, --mirror
- Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
- -nH, --no-host-directories
- Disable generation of host-prefixed directories.
- --no-cache
- Disable server-side cache. In this case, wget will send the remote server an appropriate directive (Pragma: no-cache) to get the file from the remote service, rather than returning the cached version.
- --no-check-certificate
- Don't check the server certificate against the available certificate authorities.
- -nv, --no-verbose
- Turn off verbose without being completely quiet, which means that error messages and basic information still get printed.
- -np, --no-parent
- Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
- -O FILE, --output-document=FILE
- The documents will not be written to the appropriate files, but all will be concatenated together and written to FILE. If - is used as FILE, documents will be printed to standard output, disabling link conversion.
- -p, --page-requisites
- This option causes wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
- -P PREFIX, --directory-prefix=PREFIX
- Set directory prefix to PREFIX. The directory PREFIX is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
- -r, --recursive
- Turn on recursive retrieving.
- --spider
- When invoked with this option, wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
- --user=USER
--password=PASSWORD - Specify the username USER and password PASSWORD for both FTP and HTTP file retrieval.
- -X LIST, --exclude-directories=LIST
- Specify a comma-separated LIST of directories you wish to exclude from download. Elements of LIST may contain wildcards.
Examples
- Archive an entry web site
wget --recursive --level='inf' --no-parent --page-requisites --adjust-extension \
--convert-links 'https://www.raysoft.ch/'
- Check bookmarks
wget --spider --force-html --input-file="${HOME}/tmp/bookmarks.html"