wget

From RaySoft

GNU Wget (or just wget) is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get, connotative of its primary function. It supports downloading via HTTP, HTTPS, and FTP protocols, the most popular TCP/IP-based protocols used for web browsing.[1]

Its features include recursive download, conversion of links for offline viewing of local HTML, support for proxies, and much more. It appeared in 1996, coinciding with the boom of popularity of the Web, causing its wide use among Unix users and distribution with most major Linux distributions. Written in portable C, wget can be easily installed on any Unix-like system and has been ported to many environments.[1]

Documentation

Syntax

wget [PARAMETER ...] [URL ...]

Parameters

-A LIST, --accept=LIST
-R LIST, --reject=LIST
Specify comma-separated LISTs of file name suffixes or patterns to accept or reject.
NOTE:
If any of the wildcard characters, *, ?, [ or ], appear in an element of LIST, it will be treated as a pattern, rather than a suffix.
-c, --continue
Continue getting a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of wget, or by another program.
-E, --adjust-extension
If a file of type application/xhtml, application/xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
-F, --force-html
When input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding <base href="URL"> to HTML, or using the --base command-line option.
-i FILE, --input-file=FILE
Read URLs from a local or external FILE. If - is specified as file, URLs are read from the standard input.
If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line.
-k, --convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
-l DEPTH, --level=DEPTH
Specify recursion maximum DEPTH level. Specify inf for infinite levels. The default maximum depth is 5.
--load-cookies=FILE
Load cookies from FILE before the first HTTP retrieval. FILE is a textual file in the format originally used by Netscape's cookies.txt file.
-m, --mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
-nH, --no-host-directories
Disable generation of host-prefixed directories.
--no-cache
Disable server-side cache. In this case, wget will send the remote server an appropriate directive (Pragma: no-cache) to get the file from the remote service, rather than returning the cached version.
--no-check-certificate
Don't check the server certificate against the available certificate authorities.
-nv, --no-verbose
Turn off verbose without being completely quiet, which means that error messages and basic information still get printed.
-np, --no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
-O FILE, --output-document=FILE
The documents will not be written to the appropriate files, but all will be concatenated together and written to FILE. If - is used as FILE, documents will be printed to standard output, disabling link conversion.
-p, --page-requisites
This option causes wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
-P PREFIX, --directory-prefix=PREFIX
Set directory prefix to PREFIX. The directory PREFIX is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
-r, --recursive
Turn on recursive retrieving.
--spider
When invoked with this option, wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
--user=USER
--password=PASSWORD
Specify the username USER and password PASSWORD for both FTP and HTTP file retrieval.
-X LIST, --exclude-directories=LIST
Specify a comma-separated LIST of directories you wish to exclude from download. Elements of LIST may contain wildcards.

Examples

Archive an entry web site
wget --recursive --level='inf' --no-parent --page-requisites --adjust-extension \
  --convert-links 'https://www.raysoft.ch/'
Check bookmarks
wget --spider --force-html --input-file="${HOME}/tmp/bookmarks.html"

References