博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
爬虫几大框架解读
阅读量:5143 次
发布时间:2019-06-13

本文共 3052 字,大约阅读时间需要 10 分钟。

1.pysider的demo(常规操作)

from pyspider.libs.base_handler import *class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) def detail_page(self, response): return { "url": response.url, "title": response.doc('title').text(), }

 2.newspaper

基本是用于文本,文献分析,常用于文本类型提取

>>> from newspaper import Article>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' >>> article = Article(url)
>>> article.download()>>> article.html'
>>> article.parse()>>> article.authors['Leigh Ann Caldwell', 'John Honway']>>> article.publish_datedatetime.datetime(2013, 12, 30, 0, 0)>>> article.text'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'>>> article.top_image'http://someCDN.com/blah/blah/blah/file.png'>>> article.movies['http://youtube.com/path/to/link.com', ...]
>>> article.nlp()>>> article.keywords['New Years', 'resolution', ...]>>> article.summary'The study shows that 93% of people ...'
>>> import newspaper>>> cnn_paper = newspaper.build('http://cnn.com')>>> for article in cnn_paper.articles: >>> print(article.url) http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/ http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html ... >>> for category in cnn_paper.category_urls(): >>> print(category) http://lifestyle.cnn.com http://cnn.com/world http://tech.cnn.com ... >>> cnn_article = cnn_paper.articles[0] >>> cnn_article.download() >>> cnn_article.parse() >>> cnn_article.nlp() ...
>>> from newspaper import fulltext>>> html = requests.get(...).text>>> text = fulltext(html)

Newspaper can extract and detect languages seamlessly. If no language is specified, Newspaper will attempt to auto detect a language.

>>> from newspaper import Article>>> url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml' >>> a = Article(url, language='zh') # Chinese >>> a.download() >>> a.parse() >>> print(a.text[:150]) 香港行政长官梁振英在各方压力下就其大宅的违章建 筑(僭建)问题到立法会接受质询,并向香港民众道歉。 梁振英在星期二(12月10日)的答问大会开始之际 在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的 意图和动机。 一些亲北京阵营议员欢迎梁振英道歉, 且认为应能获得香港民众接受,但这些议员也质问梁振英有 >>> print(a.title) 港特首梁振英就住宅违建事件道歉

If you are certain that an entire news source is in one language, go ahead and use the same api :)

>>> import newspaper>>> sina_paper = newspaper.build('http://www.sina.com.cn/', language='zh') >>> for category in sina_paper.category_urls(): >>> print(category) http://health.sina.com.cn http://eladies.sina.com.cn http://english.sina.com ... >>> article = sina_paper.articles[0] >>> article.download() >>> article.parse() >>> print(article.text) 新浪武汉汽车综合 随着汽车市场的日趋成熟, 传统的“集全家之力抱得爱车归”的全额购车模式已然过时, 另一种轻松的新兴 车模式――金融购车正逐步成为时下消费者购 买爱车最为时尚的消费理念,他们认为,这种新颖的购车 模式既能在短期内 ... >>> print(article.title) 两年双免0手续0利率 科鲁兹掀背金融轻松购_武汉车市_武汉汽 车网_新浪汽车_新浪网

转载于:https://www.cnblogs.com/jiabotao/p/10432316.html

你可能感兴趣的文章
uva 387 A Puzzling Problem (回溯)
查看>>
Oracle中包的创建
查看>>
django高级应用(分页功能)
查看>>
【转】Linux之printf命令
查看>>
关于PHP会话:session和cookie
查看>>
C#double转化成字符串 保留小数位数, 不以科学计数法的形式出现。
查看>>
利用IP地址查询接口来查询IP归属地
查看>>
构造者模式
查看>>
Hbuild在线云ios打包失败,提示BuildConfigure Failed 31013 App Store 图标 未找到 解决方法...
查看>>
找到树中指定id的所有父节点
查看>>
jQuery on(),live(),trigger()
查看>>
Date Picker控件:
查看>>
你的第一个Django程序
查看>>
treegrid.bootstrap使用说明
查看>>
[Docker]Docker拉取,上传镜像到Harbor仓库
查看>>
javascript 浏览器类型检测
查看>>
nginx 不带www到www域名的重定向
查看>>
记录:Android中StackOverflow的问题
查看>>
导航,头部,CSS基础
查看>>
[草稿]挂载新硬盘
查看>>