我试图格式的全部输出我美丽的汤网页刮刀在这里。输出如下:
AT-FVFX1BN7J1WK:Python 522672$ /Library/Frameworks/Python.framework/Versions/3.7/bin/python3 "/Users/522672/Desktop/Python/Scraper/Beautiful Soup/Python2.py"
[<div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="@-yet">@-yet</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ADDI-DATA">ADDI-DATA</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ADE-Werk">ADE-Werk</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="Adelmann Umwelt">Adelmann Umwelt</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="Ademco 1">Ademco 1</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="adesso">adesso</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ADITO Software">ADITO Software</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ADMOS Gleitlager">ADMOS Gleitlager</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ads-tec Industrial IT">ads-tec Industrial IT</div>
</div>, <div class="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left">
<div class="ellipsis" title="ADVES">ADVES</div>
</div>]
这是我在打印company_name时得到的原始输出,但是我不知道如何将company_name格式化为只有公司名称。所以当我打印company_name时,我只会得到一个完整的公司列表,就像“@-yet”或“ADDI-ADTA”那样。
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://www.vdma.org/en/mitglieder?p_p_lifecycle=2&p_p_resource_id=getPage&p_p_id=vdma2publicusers_WAR_vdma2publicusers&s=&page=5'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
company_name = soup.find_all('div', class_="col-xs-7 col-sm-7 col-md-7 col-lg-7 text-left")
company_website = soup.find_all('div', class_="col-xs-5 col-sm-5 col-md-5 col-lg-5 text-right")
company_adress = soup.find_all('div', class_="col-xs-5 col-sm-5 col-md-5 col-lg-5")
company_contact = soup.find_all('div', class_="col-xs-10 col-sm-10 col-md-9 col-lg-9")
问题来源StackOverflow 地址:/questions/59383246/how-to-format-an-entire-output-list-of-a-beautiful-soup
尝试使用这个CSS选择器来获得公司名称 更新后的代码:
from bs4 import BeautifulSoup
import requests
url = 'https://www.vdma.org/en/mitglieder?p_p_lifecycle=2&p_p_resource_id=getPage&p_p_id=vdma2publicusers_WAR_vdma2publicusers&s=&page=5'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
with open ("test.txt", "w") as output:
companies = soup.select('.col-lg-10')
for company in companies:
company_name = company.select('.text-left')[0].text.strip()
company_contacts = company.select('.col-lg-9 .ellipsis')
# If you want to check the type of every contact
# for contact in company_contacts:
# if "@" in contact.text.strip():
# print("Contact is email")
# else:
# print("Contact is a number")
output.write(f"Name: {company_name}\nContacts: {', '.join([contact.text.strip() for contact in company_contacts])}\n\n")
# Output
# Name: 2W Technische Informations
# Contacts: info@2wgmbh.de, (+49 89) 5 20 35-0
# Name: 3 S Schnecken + Spindeln + Spiralen
# Contacts: office@3s-gmbh.at, (+43 7613) 50 04
# Name: 365FarmNet
# Contacts: info@365farmnet.com, (+49 30) 2 59 32 95 00, (+49 30) 2 59 32 95 01
# Name: 3D Interaction Technologies
# Contacts: info@3dit.de, (+49 351) 21 96-74 95
# ...
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。