600字范文 > 用python抓取网页中所有pdf文件的笨方法

用python抓取网页中所有pdf文件的笨方法

时间：2019-11-16 20:19:58

进入下载中心： /en/download-center/在网页任意地方点击右键，后选择inspection

右边选择elements

一直向下翻找到 “catgroup downloads"

或者合并第3-4步，直接在本页第一个下载链接点击右键，选择inspection

这样可以直接在链接上面看到catgroup downloads

此时可以看到，所有的catgroup downloads，这里面包含了/en/download-center/所有的应用文档

点击右键第一个 < div class=“catgroup downloads”>, 选择Edit as HTML

后出现当前< div lass=“catgroup downloads” > 中间的所有内容

将里面的信息都复制到文件DownloadCenter_catgroupDownloads.txt中

python scripter下载这个txt中间的所有pdf

# -*-coding:utf-8 -*-import urllib.request as urllib2 import os def downpdf(pdflist):x = 0for pdfurl in pdflist:print(pdfurl) name = pdfurl.replace(".","/").split("/")[-2] filename = r"C:\MorganPersonalFile\Sales\Python Script\72. Grap_data\Sensirion_downloadCenter\DownloadCenter\\" + name + ".pdf"f = open(filename,'wb')f.write(urllib2.urlopen(pdfurl).read())f.close()x += 1print ("download %s pdf>>>>" %x)else:print ("download finished")folder = os.path.dirname(os.path.realpath(__file__))if os.path.exists(folder+"\\"+"DownloadCenter"):passelse:os.makedirs(folder+"\\"+"DownloadCenter")infoName = folder+"\\"+r'DownloadCenter_catgroupDownloads.txt'pdflist = []with open(infoName, 'r') as f: lines = f.readlines()for i in range(len(lines)):if (".pdf" in lines[i]) or (".PDF" in lines[i]) or (".STEP" in lines[i]) or (".step" in lines[i]):# print(lines[i].replace('href="','">').split('">'))pdflist.append(lines[i].replace('href="','">').split('">')[1])downpdf(pdflist)

清除DownloadCenter_catgroupDownloads.txt文件中间的内容，然后第二个catgroup downloads重复第6-7-8步对所有catgroup downloads操作以后，所有的pdf文件都下载到本地电脑了

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。