小工具开发：网络预览文献资料的下载合成

5月 10 2020 Research

部分互联网资料仅提供在线预览，无法直接下载。若预览系统设计欠佳，或限于网速内容加载过慢，都会对阅读体验造成严重影响。在此分别以国家自然基金研究报告和学位论文为例，介绍两种不同的下载合成方法，供有需要的老师同学们参考。欢迎大家提出宝贵意见。

1. 国家自然基金研究报告

国家自然科学基金共享服务网中公开的项目报告只能在线阅览。以《颗粒物质基本性质的研究》研究为例，在其结题报告全文页，获取首页图片网址-http://output.nsfc.gov.cn/report/10/10274071_1.png
确定目标页URL后，接下来的工作可用Python代码自动完成，代码具体如下：

import os
import natsort
from fpdf import FPDF

for i in range(0,200,1):
    os.system("wget http://output.nsfc.gov.cn/report/10/10274071_%i.png" %i)

files = os.listdir('./')
files = natsort.natsorted(files, reverse = False)
files = files[0:]
pdf = FPDF()

for image in files[0:]:
    if image[-3:]=='png':
        pdf.add_page()  
        pdf.image(image,0,0,210,297)  
pdf.output("../颗粒物质基本性质的研究.pdf", "F")

该方法缺陷有二：

报告总页数需要人工查询，下载时应尽量设置较大的循环次数(如200次)；
报告部分横置页面，在合成前未做旋转调整。

2. 研究生学位论文

一些大学的学位论文不能在中国知网等数据库直接下载，本校在线预览方式也不太方便。考虑到原始网页采用javascript动态加载论文内容(实为jpg格式图片)，无法通过wget或urllib等简单工具直接获取。因此，选用selenium工具模拟chrome浏览器鼠标操作，逐页下载。

# 主要参考网页 https://www.devdungeon.com/content/grab-image-clipboard-python-pillow
from selenium import webdriver
from time import sleep
import pyperclip,pyautogui
import numpy as np

# 下载环节
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : '/user/defined/path'}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(executable_path="./geckodriver/chromedriver", chrome_options=chrome_options)
## 此url具体位置如下图
driver.get("https://thesis.lib.pku.edu.cn/onlinePDF?dbid=72&objid=53_57_54_50_56_49&flag=online")
for t in np.arange(1,3,1):
    img_url = driver.find_element_by_id('ViewContainer_BG_0').get_attribute('src')[:-10]
    driver.get(img_url+"_%s.jpg" %"{:05d}".format(t))
    #  Move to the specified location, right click
    pyautogui.rightClick(x=600, y=500)
    # V
    pyautogui.typewrite(['V'])
    pyautogui.hotkey('ctrlleft','V')
    sleep(0.8)
    pyautogui.press('enter')
    sleep(0.8)
    pyautogui.press('enter')
driver.close()

# 图片调整环节
## 有些图片是横置的，逆时针旋转90度
import os
import natsort
from fpdf import FPDF
import cv2
files = os.listdir('/DOWNLOAD_PATH/')
files = natsort.natsorted(files, reverse = False)
files = files[0:]
for file in files:
    if file[-3:]=='jpg':
        img   = cv2.imread('/DOWNLOAD_PATH/'+file)
        h,w,c = img.shape
        if h<w:
            imgrot = cv2.rotate(img,cv2.ROTATE_90_COUNTERCLOCKWISE)
            cv2.imwrite('/DOWNLOAD_PATH/'+file,imgrot)         

# 合成PDF    
pdf = FPDF()
for file in files[0:]:
    if file[-3:]=='jpg':
        pdf.add_page()  
        pdf.image('/DOWNLOAD_PATH/'+file,0,0,210,297)  
pdf.output("FILENAME.pdf", "F")

在北大图书馆检索后，点击目标文献，在该页码打开chrome开发者工具，寻找到下列标记字段，右键选择“copy link element”即为预览页url。

#Python

小工具开发：网络预览文献资料的下载合成

1. 国家自然基金研究报告

2. 研究生学位论文

Kommentare

Your browser is out-of-date!