1. wget
wget -t 10 –limit-rate 50k -Q 10M -c http://www.linuxeye.com -O linuxeye.html -o download.log -t 指定重试次数 –limit-rate 下载限速 -Q 最大下载配额(quota) -c 断点续传 -O 指定输出文件名 -o 指定一个日志文件 wget -r -N -l 2 http://www.linuxeye.com -r recursive,递归 -N 允许对文件使用时间戳 -l 向下遍历指定的页面级数
访问需要认证的http或ftp页面
wget –user username –password pass URL
也可以不在命令行中指定密码,而由网页提示并手动输入密码,这就需要将–password改成–ask-password 。
2. curl
curl http://www.linuxeye.com –silent -o linuxeye.html –silent 不显示进度信息,如果需要这些信息,将–silent移除 -o 将下载数据写入文件,而非标准输出 –progress 以#显示进度信息 -C 断点续转 –referer 设置参照页字符串 –cookie 设置cookie –user-agent 设置用户代理字符串 –limit-rate 限制带宽 –max-filesize 指定最大下载量 -u 认证(curl -u userpass http://www.linuxeye.com) -I 只答应响应头部信息
3. 从命令行访问Gmail
#curl -u username@gmail.com:password –silent “https://mail.google.com/mail/feed/atom” | tr -d ‘\n’ | sed ‘s::\n:g’ | sed ‘s/.*\(.*\)<\/title.*\([^<]*\)<\/name>\([^<]*\).*/Author: \2 [\3] \n Subject: \1\n/’ Author: Facebook [update+kjdm15577-jd@facebookmail.com] Subject: Facebook的有趣专页 Author: offers [offers@godaddy.com] Subject: Reminder: Get 25% OFF your order – no minimum! Author: Google+ team [noreply-475ba29f@plus.google.com] Subject: Top 3 posts for you on Google+ this week Author: Facebook [update+kjdm15577-jd@facebookmail.com] Subject: Facebook的有趣专页 curl -u username@gmail.com:password –silent “https://mail.google.com/mail/feed/atom” | perl -ne ‘print “\t” if //; print “$2\n” if /<(title|name)>(.*)<\/\1>/;’
4. 从网友上抓取并下载图片的bash脚本
#!/bin/bash #FileName : img_downloader.sh if [ $# -ne 3 ]; then echo "Usage:$0 URL -d DIRECTORY" exit -1 fi for i in {1..4} do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1};shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.]+") echo $baseurl curl -s $url | egrep -o "<img src=[^>]*>" | awk -F"\"|'" '{print $2}' > /tmp/$$.list sed -i "s|^/|$baseurl/|" /tmp/$$.list cd $directory; while read filename; do echo $filename curl -s -O "$filename" --silent done < /tmp/$$.list
说明:原书脚本用sed截取图片的绝对路径只能用双引号情况下,而很多图片绝对路径可能有单引号,于是我用awk处理,也可以用sed在原脚本的基础上修改
5. curl查找网上无效链接bash脚本
#!/bin/bash if [ $# -eq 2 ]; then echo -e "$Usage $0 URL\n" exit -1; fi echo Broken links: mkdir /tmp/$$.lynx cd /tmp/$$.lynx lynx -traversal $1 > /dev/null count=0; sort -u reject.dat > links.txt while read link; do output=`curl -I $link -s | grep "HTTP/.*OK"`; if [[ -z $output ]]; then echo $link; let count++ fi done < links.txt [ $count -eq 0 ] && echo No broken links found.