国外的大学图书馆也像国内的一样吗？用Python脚本抓取期刊的主题标题！-阿里云开发者社区

国外的大学图书馆也像国内的一样吗？用Python脚本抓取期刊的主题标题！

2024-05-20 58

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 国外的大学图书馆也像国内的一样吗？用Python脚本抓取期刊的主题标题！

catalogs = {
#‘catalog name’ : {
‘base_url’ : beginning part of URL from ‘http://’ to before first ‘/’,
‘search_url’ : URL for online catalog search without base URL including ‘/’;
make sure that ‘{0}’ is in the proper place for the query of ISSN,
‘search_title’ : CSS selector for parent element of anchor
containing the journal title on search results in HTML,
‘bib_record’ : CSS selector for record metadata on catalog item’s HTML page,
‘bib_title’ : CSS selector for parent element of anchor containing the journal title,
‘bib_subjects’ : HTML selector for specific table element where text begins with
“Topics”, “Subject” in catalog item’s HTML page in context of bib_record
‘worldcat’ : {
‘base_url’ : “https://www.worldcat.org”,
‘search_url’ : “/search?qt=worldcat_org_all&q={0}”,
‘search_title’ : “.result.details .name”,
‘bib_record’ : “div#bibdata”,
‘bib_title’ : “div#bibdata h1.title”,
‘bib_subjects’ : “th”
},
‘carli_i-share’ : {
‘base_url’ : “https://vufind.carli.illinois.edu”,
‘search_url’ : “/all/vf-sie/Search/Home?lookfor={0}&type=isn&start_over=0&submit=Find&search=new”,
‘search_title’ : “.result .resultitem”,
‘bib_record’ : “.record table.citation”,
‘bib_title’ : “.record h1”,
‘bib_subjects’ : “th”
},
‘mobius’ : {
‘base_url’ : ‘https://searchmobius.org’,
‘search_url’ : “/iii/encore/search/C__S{0}%20__Orightresult__U?lang=eng&suite=cobalt”,
‘search_title’ : “.dpBibTitle .title”,
‘bib_record’ : “table#bibInfoDetails”,
‘bib_title’ : “div#bibTitle”,
‘bib_subjects’ : “td”
}
}
Obtain the right parameters for specific catalog systems
Input: catalog name: ‘worldcat’, ‘carli i-share’, ‘mobius’
Output: dictionary of catalog parameters
def get_catalog_params(catalog_key):
try:
return catalogs[catalog_key]
except:
print(‘Error - unknown catalog %s’ % catalog_key)
Search catalog for item by ISSN
Input: ISSN, catalog parameters
Output: full URL for catalog item
def search_catalog (issn, p = catalogs[‘carli_i-share’]):
title_url = None
catalog url for searching by issn
url = p[‘base_url’] + p[‘search_url’].format(issn)
u = urlopen (url)
try:
html = u.read().decode(‘utf-8’)
finally:
u.close()
try:
soup = BeautifulSoup (html, features=“html.parser”)
title = soup.select(p[‘search_title’])[0]
title_url = title.find(“a”)[‘href’]
except:
print(‘Error - unable to search catalog by ISSN’)
return title_url
return p[‘base_url’] + title_url
Scrape catalog item URL for metadata
Input: full URL, catalog parameters
Output: dictionary of catalog item metadata,
including title and subjects
def scrape_catalog_item(url, p = catalogs[‘carli_i-share’]):
result = {‘title’:None, ‘subjects’:None}
u = urlopen (url)
try:
html = u.read().decode(‘utf-8’)
finally:
u.close()
try:
soup = BeautifulSoup (html, features=“html.parser”)

title

try:
title = soup.select_one(p[‘bib_title’]).contents[0].strip()

save title to result dictionary

result[“title”] = title
except:
print(‘Error - unable to scrape title from url’)

subjects

try:
record = soup.select_one(p[‘bib_record’])
subject = record.find_all(p[‘bib_subjects’], string=re.compile(“(Subjects*|Topics*)”))[0]
subject_header_row = subject.parent
subject_anchors = subject_header_row.find_all(“a”)
subjects = []
for anchor in subject_anchors:
subjects.append(anchor.string.strip())

save subjects to result dictionary

result[“subjects”] = subjects
except:
print(‘Error - unable to scrape subjects from url’)
except:
print(‘Error - unable to scrape url’)
return result

Search for catalog item and process metadata from item’s HTML page
Input: ISSN, catalog paramters
Output: dictionary of values: issn, catalog url, title, subjects
def get_issn_data(issn, p = catalogs[‘carli_i-share’]):
results = {‘issn’:issn, ‘url’:None, ‘title’:None, ‘subjects’:None}
time.sleep(time_delay)
url = search_catalog(issn, params)
results[‘url’] = url
if url: # only parse metadata for valid URL
time.sleep(time_delay)
item_data = scrape_catalog_item(url, params)
results[‘title’] = item_data[‘title’]
if item_data[‘subjects’] is not None:
results[‘subjects’] = ‘,’.join(item_data[‘subjects’]).replace(‘, -’, ’ - ')
return results

main loop to parse all journals

time_delay = 0.5 # time delay in seconds to prevent Denial of Service (DoS)
try:

setup arguments for command line

args = sys.argv[1:]
parser = argparse.ArgumentParser(description=‘Scrape out metadata from online catalogs for an ISSN’)
parser.add_argument(‘catalog’, type=str, choices=(‘worldcat’, ‘carli_i-share’, ‘mobius’), help=‘Catalog name’)
parser.add_argument(‘-b’, ‘–batch’, nargs=1, metavar=(‘Input CSV’), help=‘Run in batch mode - processing multiple ISSNs’)
parser.add_argument(‘-s’, ‘–single’, nargs=1, metavar=(‘ISSN’), help=‘Run for single ISSN’)
args = parser.parse_args()
params = get_catalog_params(args.catalog) # catalog parameters

single ISSN

if args.single is not None:
issn = args.single[0]
r = get_issn_data(issn, params)
print(‘ISSN: {0}\r\nURL: {1}\r\nTitle: {2}\r\nSubjects: {3}’.format(r[‘issn’], r[‘url’], r[‘title’], r[‘subjects’]))

multiple ISSNs

elif args.batch is not None:
input_filename = args.batch[0]
output_filename = ‘batch_output_{0}.csv’.format(args.catalog) # put name of catalog at end of output file
with open(input_filename, mode=‘r’) as csv_input, open(output_filename, mode=‘w’, newline=‘’, encoding=‘utf-8’) as csv_output:
read_in = csv.reader(csv_input, delimiter=‘,’)
write_out = csv.writer(csv_output, delimiter=‘,’, quotechar=‘"’, quoting=csv.QUOTE_MINIMAL)
write_out.writerow([‘ISSN’, ‘URL’, ‘Title’, ‘Subjects’]) # write out headers to output file
total_rows = sum(1 for row in read_in) # read all rows to get total
csv_input.seek(0) # move back to beginning of file
read_in = csv.reader(csv_input, delimiter=‘,’) # reload csv reader object
for row in tqdm(read_in, total=total_rows): # tqdm is progress bar

each row is an ISSN

issn = row[0]
r = get_issn_data(issn, params)
write_out.writerow([r[‘issn’], r[‘url’], r[‘title’], r[‘subjects’]])

文末有福利领取哦~

👉一、Python所有方向的学习路线

Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。

👉二、Python必备开发工具

👉三、Python视频合集

观看零基础学习视频，看视频学习是最快捷也是最有效果的方式，跟着视频中老师的思路，从基础到深入，还是很容易入门的。

👉 四、实战案例

光学理论是没用的，要学会跟着一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。（文末领读者福利）

👉五、Python练习题

检查学习结果。

👉六、面试资料

我们学习Python必然是为了找到高薪的工作，下面这些面试题是来自阿里、腾讯、字节等一线互联网大厂最新的面试资料，并且有阿里大佬给出了权威的解答，刷完这一套面试资料相信大家都能找到满意的工作。

👉因篇幅有限，仅展示部分资料，这份完整版的Python全套学习资料已经上传

国外的大学图书馆也像国内的一样吗？用Python脚本抓取期刊的主题标题！

title

save title to result dictionary

subjects

save subjects to result dictionary

main loop to parse all journals

setup arguments for command line

single ISSN

multiple ISSNs

each row is an ISSN

文末有福利领取哦~

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

国外的大学图书馆也像国内的一样吗？用Python脚本抓取期刊的主题标题！

title

save title to result dictionary

subjects

save subjects to result dictionary

main loop to parse all journals

setup arguments for command line

single ISSN

multiple ISSNs

each row is an ISSN

文末有福利领取哦~

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像