您好,我正在尝试获取所有/ pubmed /编号,这些编号将我链接到特定作者的文章摘要。问题是,当我尝试这样做时,我只会一遍又一遍地获得相同的数字,直到for循环结束为止。
我要获取的href应当取自for in line in lines
循环的输出(具体的href在输出示例中)。该循环似乎工作良好,但随后,“抽象中的抽象”循环仅重复相同的href。
任何建议或想法,我缺少或做错了什么。我对bs4没有太多的经验,所以我可能没有很好地使用该库。
#Obtain all the papers of a scientific author and write its abstract in a new file
from bs4 import BeautifulSoup
import re
import requests
url="https://www.ncbi.nlm.nih.gov/pubmed/?term=valvano"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
lines = soup.find_all("div",{"class": "rslt"})
authors= soup.find_all("p",{"class": "desc"})
scientist=[]
for author in authors:
#print('\n', author.text)
scientist.append(author.text)
s=[]
for i in scientist:
L=i.split(',')
s.append(L)
n=0
for line in lines:
if ' Valvano MA' in s[n] or 'Valvano MA' in s[n] :
print('\n',line.text)
#part of one output:
<a \*href="/pubmed/32146294"\* ...
found = soup.find("a",{"class": "status_icon nohighlight"})
web_abstract='https://www.ncbi.nlm.nih.gov{}'.format(found['href'])
response0 = requests.get(web_abstract)
sopa = BeautifulSoup(response0.content, 'lxml')
abstracts = sopa.find("div",{"class": "abstr"})
for abstract in abstracts:
#print (abstract.text)
print('https://www.ncbi.nlm.nih.gov{}'.format(found['href']))
#output:
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31919170
n=n+1
else:
n=n+1
#expected output:
https://www.ncbi.nlm.nih.gov/pubmed/32146294
https://www.ncbi.nlm.nih.gov/pubmed/32064693
https://www.ncbi.nlm.nih.gov/pubmed/31978399
https://www.ncbi.nlm.nih.gov/pubmed/31919170
https://www.ncbi.nlm.nih.gov/pubmed/31896348
https://www.ncbi.nlm.nih.gov/pubmed/31866961
https://www.ncbi.nlm.nih.gov/pubmed/31722994
https://www.ncbi.nlm.nih.gov/pubmed/31350337
https://www.ncbi.nlm.nih.gov/pubmed/31332863
https://www.ncbi.nlm.nih.gov/pubmed/31233657
https://www.ncbi.nlm.nih.gov/pubmed/31133642
https://www.ncbi.nlm.nih.gov/pubmed/30913267
问题来源:stackoverflow
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。
不需要那么复杂。只需使用:
ids = soup.select('dt + dd')
for i in ids:
pmid = i.text
print(f'https://www.ncbi.nlm.nih.gov/pubmed/{pmid}')
回答来源:stackoverflow