csdn_spider/README.md

35 lines
1012 B
Markdown
Raw Normal View History

2019-10-24 13:03:51 +00:00
# CSDN 爬虫
2019-10-19 11:23:00 +00:00
2019-10-24 13:03:51 +00:00
> 主要功能:爬取 csdn 博客指定用户的所有博文并转换为 markdown 格式保存到本地。
2019-10-19 11:23:00 +00:00
2019-10-24 13:03:51 +00:00
## 下载脚本
2019-11-08 03:57:43 +00:00
```shell
2019-10-19 11:23:00 +00:00
git clone https://github.com/ds19991999/csdn-spider.git
2019-10-24 13:03:51 +00:00
cd csdn-spider
python3 -m pip install -r requirements.txt
2019-11-08 03:57:43 +00:00
# 测试
python3 test.py # 需要先配置登录 cookie
2019-10-19 11:23:00 +00:00
```
2019-11-08 03:57:43 +00:00
## 获取 cookie
登录 `csdn` 账号进入https://blog.csdn.net ,按 `F12` 调试网页,复制所有的 `Request Headers`,保存到`cookie.txt`文件中
![1571482112632](assets/1571482112632.png)
2019-10-24 13:03:51 +00:00
## 爬取用户全部博文
```python
2019-10-24 15:17:40 +00:00
import csdn
2019-11-08 03:57:43 +00:00
csdn.spider(["ds19991999", "u013088062"], "cookie.txt",5)
# 参数 usernames: list, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"
2019-10-19 11:23:00 +00:00
```
2019-10-24 13:03:51 +00:00
## LICENSE
2019-10-19 11:23:00 +00:00
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
2019-10-24 13:03:51 +00:00
`PS`:随意写的爬虫脚本,佛系更新。