commit demo
parent
604709b719
commit
db79d3c171
63
README.md
63
README.md
|
@ -1 +1,62 @@
|
||||||
# csdn-spider
|
# CSDN 爬虫脚本
|
||||||
|
|
||||||
|
主要功能:爬取 `csdn` 博客指定用户的所有博文并转换为 `markdown` 格式保存到本地。
|
||||||
|
|
||||||
|
## 一、运行环境
|
||||||
|
|
||||||
|
需要安装`WebDriver`驱动,https://chromedriver.chromium.org/downloads,下载与本地对应的`chrome`驱动后,将其添加至环境变量`$PATH`
|
||||||
|
|
||||||
|
```shell
|
||||||
|
python3
|
||||||
|
python3 -m pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 二、获取脚本
|
||||||
|
|
||||||
|
```shell
|
||||||
|
git clone https://github.com/ds19991999/csdn-spider.git
|
||||||
|
```
|
||||||
|
|
||||||
|
## 三、用法
|
||||||
|
|
||||||
|
### 1.获取cookie
|
||||||
|
|
||||||
|
登录 `csdn` 账号,进入:https://blog.csdn.net ,按 `F12` 调试网页,复制所有的 `Request Headers`,保存到`cookie.txt`文件中
|
||||||
|
|
||||||
|
![1571482112632](assets/1571482112632.png)
|
||||||
|
|
||||||
|
### 2.添加需要爬取的 `csdn` 用户
|
||||||
|
|
||||||
|
在`username.txt`中添加用户名,一行一个
|
||||||
|
|
||||||
|
### 3.运行脚本
|
||||||
|
|
||||||
|
```shell
|
||||||
|
python3 csdn.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## 四、效果
|
||||||
|
|
||||||
|
**运行过程**
|
||||||
|
|
||||||
|
![1571483423256](assets/1571483423256.png)
|
||||||
|
|
||||||
|
**文章列表建立**:`./articles/username/README.md`
|
||||||
|
|
||||||
|
![1571483552438](assets/1571483552438.png)
|
||||||
|
|
||||||
|
**爬取的博文**:`./articles/username/`
|
||||||
|
|
||||||
|
![1571483479356](assets/1571483479356.png)
|
||||||
|
|
||||||
|
**博文转换效果**:
|
||||||
|
|
||||||
|
![1571483777703](assets/1571483777703.png)
|
||||||
|
|
||||||
|
## 五、LICENSE
|
||||||
|
|
||||||
|
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
`PS`:随意写的爬虫脚本,佛系更新。
|
|
@ -0,0 +1,186 @@
|
||||||
|
# 1.原创:Debian快速手动安装JupyterLab并配置Https
|
||||||
|
|
||||||
|
很久之前我写过一篇关于`Jupyer lab`得超详细安装教程,[`传送门`](https://www.creat.kim/archives/25/),感觉复杂了点,特别是`nginx`,我这块也没写清楚,所以不少人出现了无法运行`python`的情况,按照教程一步步来是绝对不会出问题的。有时候,虽然你能够用`https`访问,但是不代表就能运行,因为这里`jupyter lab`是基于`websocket`通信的,不是`http`。这里就再简化一下,用`Debian`系统安装一下`Jupyter Lab`,并使用`caddy`配置`https`访问,亲测可以运行程序。本教程只包括`Pytho2`内核,要同时安装`Python3`见[`传送门`](https://www.creat.kim/archives/25/),这里简单写下步骤,快速上手,避免花费过多时间,一次成功,速度还蛮快的. demo: [https://jupyter.creat.kim](https://jupyter.creat.kim)<br/>
|
||||||
|
<img alt="" src="http://image.creat.kim/picgo/20190326142651.png"/><br/>
|
||||||
|
<img alt="" src="http://image.creat.kim/picgo/20190326151655.png"/>
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo apt-get install software-properties-common
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装`Python`环境
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo apt-get install python-pip python-dev build-essential
|
||||||
|
sudo pip install --upgrade pip
|
||||||
|
sudo pip install --upgrade virtualenv
|
||||||
|
sudo apt-get install python-setuptools python-dev build-essential
|
||||||
|
sudo easy_install pip
|
||||||
|
sudo pip install --upgrade virtualenv
|
||||||
|
sudo apt-get install python3-pip
|
||||||
|
sudo apt-get install python-pip
|
||||||
|
sudo pip3 install --upgrade pip
|
||||||
|
sudo pip2 install --upgrade pip
|
||||||
|
sudo pip install --upgrade pip
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 查看`pip`指向
|
||||||
|
|
||||||
|
```
|
||||||
|
~ $which pip
|
||||||
|
/usr/local/bin/pip
|
||||||
|
21:36 alien@alien-Inspiron-3443:
|
||||||
|
~ $which pip2
|
||||||
|
/usr/local/bin/pip2
|
||||||
|
21:36 alien@alien-Inspiron-3443:
|
||||||
|
~ $which pip3
|
||||||
|
/usr/local/bin/pip3
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装`yarn`
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
|
||||||
|
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install yarn
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装`nodejs`
|
||||||
|
|
||||||
|
```
|
||||||
|
curl -sL https://deb.nodesource.com/setup_10.x | bash -
|
||||||
|
apt-get install -y nodejs
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 安装`jupyterlab`
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo pip2 install jupyterlab
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 配置`jupyerlab`
|
||||||
|
|
||||||
|
```
|
||||||
|
jupyter-notebook password
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
进入`ipython`设置哈希密码,这里输入的是你登陆`jupyter lab`的密码,记下生成的哈希密码.
|
||||||
|
|
||||||
|
```
|
||||||
|
ipython
|
||||||
|
from notebook.auth import passwd
|
||||||
|
passwd()
|
||||||
|
# 输入你自己设置登录JupyterLab界面的密码,
|
||||||
|
# 然后就会生产下面这样的密码,将它记下来,待会儿用
|
||||||
|
'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704'
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 编辑配置文件
|
||||||
|
|
||||||
|
一般在`/root/.jupyter/jupyter_notebook_config.py`中,找到并修改以下配置项。
|
||||||
|
|
||||||
|
```
|
||||||
|
c.NotebookApp.allow_root = True
|
||||||
|
c.NotebookApp.ip = '0.0.0.0'
|
||||||
|
c.NotebookApp.notebook_dir = u'/root/JupyterLab'
|
||||||
|
c.NotebookApp.open_browser = False
|
||||||
|
c.NotebookApp.password = u'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704'
|
||||||
|
c.NotebookApp.port = 8888
|
||||||
|
|
||||||
|
# 解释以上各项
|
||||||
|
允许以root方式运行jupyterlab
|
||||||
|
允许任意ip段访问
|
||||||
|
设置jupyterlab页面的根目录
|
||||||
|
默认运行时不启动浏览器,因为服务器默认只有终端嘛
|
||||||
|
设置之前生产的哈希密码
|
||||||
|
设置访问端口,与下面的caddy需一致
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 运行`Jupyter Lab`
|
||||||
|
|
||||||
|
```
|
||||||
|
jupyter-lab --version
|
||||||
|
jupyter lab build
|
||||||
|
|
||||||
|
mkdir ~/JupyterLab
|
||||||
|
cd ~/JupyterLab
|
||||||
|
|
||||||
|
# 方便后台运行
|
||||||
|
apt install screen
|
||||||
|
screen -S jupterlab
|
||||||
|
jupyter lab
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
`ctrl+A+D`退出这个窗口。
|
||||||
|
|
||||||
|
## `caddy`开启`https`反代
|
||||||
|
|
||||||
|
域名改成你自己的,`caddy`详细使用见:[`【传送门】`](https://www.creat.kim/archives/18/)
|
||||||
|
|
||||||
|
```
|
||||||
|
wget -N --no-check-certificate https://raw.githubusercontent.com/ds19991999/shell.sh/shell/caddy_install.sh && chmod +x caddy_install.sh && bash caddy_install.sh
|
||||||
|
|
||||||
|
echo "jupyter.creat.kim
|
||||||
|
gzip
|
||||||
|
tls cva.engineer.ding@gmail.com
|
||||||
|
proxy / 127.0.0.1:8888 {
|
||||||
|
transparent
|
||||||
|
websocket
|
||||||
|
}" > /usr/local/caddy/Caddyfile
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 定时备份到`GitHub`
|
||||||
|
|
||||||
|
见大佬写的比较详细的文章:[`【传送门】`](https://www.moerats.com/archives/858/)
|
||||||
|
|
||||||
|
## 配置`python2`和`python3`内核
|
||||||
|
|
||||||
|
好人做到底吧,这里肯定很多人踩坑。。。用`pip`安装包的时候千万不要用`pip3 install ***`或者`pip2 install ***`呀.
|
||||||
|
|
||||||
|
```
|
||||||
|
python2 -m pip install ipykernel ipython matplotlib scipy pandas numpy
|
||||||
|
python3 -m pip install ipykernel ipython matplotlib scipy pandas numpy
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
检查一下内核
|
||||||
|
|
||||||
|
```
|
||||||
|
root@google:~/JupyterLab# jupyter kernelspec list
|
||||||
|
Available kernels:
|
||||||
|
python2 /usr/local/share/jupyter/kernels/python2
|
||||||
|
python3 /usr/local/share/jupyter/kernels/python3
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
好了,访问域名,开始使用吧。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## 最后一点思悟
|
||||||
|
|
||||||
|
大概这是我发在`CSDN`最后的博文了,本文来自 [https://www.creat.kim/archives/40/](https://www.creat.kim/archives/40/) ,不错,终于抛弃公共博客平台了。我在`CSDN`写了差不多一年半左右的博文吧,共`107`篇,其中`97`篇原(chao)创(xi),`7`篇转载,`2`篇私密,`1`篇因违反相关政策被管理员设为私密 … 博客`CSDN`排名`10k+`,访问量`225k+`,粉丝数`48`,表现平平,博文水平一般,算是代表了大部分人吧。
|
||||||
|
|
||||||
|
国内的博客平台其实都不错,`CSDN` 的写作体验也非常好,我曾经也一度在自己的博客平台或者公共博客平台之间徘徊,慢慢的最初写博客的意义就变味了,不过经历过这个过程,大概就明白了一些事吧。
|
||||||
|
|
||||||
|
在尝试`WordPress` 、`知乎` 、`简书`、`博客园`、`新浪`、`GitHub-Jekyll` 、`coding-jekyll`、`hexo` 、`Typecho`…之后,了解了一些网站运行常识,最起码知道国内的都是需要备案的 …<br/>
|
||||||
|
在图床方面,从最初的直接复制粘贴到`GitHub`+`PicGo`、`又拍云` (需要备案)、`七牛云`(需要备案)、自建图床…明白了一些`CDN`加速技巧 …<br/>
|
||||||
|
在文档方面,从最初的直接编辑,到`CSDN`的`MarkDown`编辑器、`有道云笔记`、`Evernote`(分国外国内版本)、`GitHub-README`、`GitBook`、`MkDoc`、`Read the Docs`、`Sphinx`、`Docsify`,明白了孰能生巧,熟练的话,什么文本都能写的漂亮,虽然我至今不会`Vim` …<br/>
|
||||||
|
在服务器选择上面,国内和国外的差异,也了解了不少,也越来越深恶痛绝 `install` 一个包或者一个`程序`的时候,你就那么几`k`几`b`的跑,国内源再怎么换,也比不上国外源的速度,有些网站虽然没有被`q`,你本地那速度受的了吗,现在也服气当初我是怎么忍受那龟一般的网速。看到过,了解过,才能从另一个角度看待问题,总比一直看被经过过滤的信息强吧。
|
||||||
|
|
||||||
|
再看看国外的教育福利,有人说是国外被中国人撸羊毛撸怕了,所以就不给中国提供教育福利。但是你看看国内大厂的教育福利,那服务器多便宜,我自己都眼馋,赶紧去每个厂注册一个号。要求实名,好,我实名,我传照片;要求备案,啥,还备案,好,我备案,我传照片,又是一个星期;这咋还有监测呢,忍不了了 … 这像不像裸贷,你只要用身份证实名,把自己的靓照交给他,他就给你提供廉价的服务器,这里说的有点过了,哈哈哈。前不久谷歌也要求中国IP注册地需要传照片了,唯独中国。国外在教育方面的投资我们真的要好好学习学习 …
|
||||||
|
|
||||||
|
之前的`12306事件`、`蓝灯事件`、`某某数据库泄露`,真真假假假亦真。身在国内,就不得不用隐私换取便利。
|
|
@ -0,0 +1,33 @@
|
||||||
|
# 2.原创:解决套路云Debian新机update的时候出现Waiting for headers和404错误
|
||||||
|
|
||||||
|
```
|
||||||
|
rm -rf /root/.pip /root/.pydistutils.cfg /etc/apt/sources.list.d/sources-aliyun-0.list /etc/apt/sources.list.d/sources-aliyun* /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free
|
||||||
|
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free
|
||||||
|
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib
|
||||||
|
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib
|
||||||
|
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free
|
||||||
|
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free
|
||||||
|
|
||||||
|
## Uncomment the following two lines to add software from the 'backports'
|
||||||
|
## repository.
|
||||||
|
##
|
||||||
|
## N.B. software from this repository may not have been tested as
|
||||||
|
## extensively as that contained in the main release, although it includes
|
||||||
|
## newer versions of some applications which may provide useful features.
|
||||||
|
#deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free
|
||||||
|
#deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
apt-get clean
|
||||||
|
apt-get update
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
套路云还是套路云,服气!!!
|
|
@ -0,0 +1,3 @@
|
||||||
|
# 3.原创:Jekyll 博客 Netlify CMS 后台部署
|
||||||
|
|
||||||
|
### 文章目录
|
|
@ -0,0 +1,71 @@
|
||||||
|
# 4.原创:Let's Encrypt 泛域名证书申请
|
||||||
|
|
||||||
|
>
|
||||||
|
github: [https://github.com/Neilpang/acme.sh](https://github.com/Neilpang/acme.sh)
|
||||||
|
|
||||||
|
|
||||||
|
通过acme申请Let’s Encrypt证书支持的域名DNS服务商有以下这些(国内用户较多的):`cloudxns、dnspod、aliyun(阿里云)、cloudflare、linode、he、digitalocean、namesilo、aws、namecom、freedns、godaddy、yandex` 等等。
|
||||||
|
|
||||||
|
### 目录
|
||||||
|
|
||||||
|
## [安装acm.sh](http://xn--acm-pd0fq01r.sh)
|
||||||
|
|
||||||
|
```
|
||||||
|
curl https://get.acme.sh | sh
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
`acme.sh`被安装在了`~./.acme.sh`,创建 一个 `bash` 的 `alias`, 方便你的使用: `alias acme.sh=~/.acme.sh/acme.sh`
|
||||||
|
|
||||||
|
通过`acme.sh`安装的证书会自动为你创建 `cronjob`, 每天 0:00 点自动检测所有的证书, 如果快过期了, 需要更新, 则会自动更新证书.
|
||||||
|
|
||||||
|
## DNS方式验证域名所有权
|
||||||
|
|
||||||
|
```
|
||||||
|
acme.sh --issue --dns -d mydomain.com
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
`acme.sh` 会生成相应的解析记录显示出来, 你只需要在你的域名管理面板中添加这条 `txt` 记录即可.
|
||||||
|
|
||||||
|
## 获取`DNS API`
|
||||||
|
|
||||||
|
获取`DNS`域名商的`DNS API` ,`api` 也会将 上面的`txt` 记录自动添加到域名解析商。比喻阿里的`api`:[https://ak-console.aliyun.com/#/accesskey](https://ak-console.aliyun.com/#/accesskey) ,然后看说明进行配置 [https://github.com/Neilpang/acme.sh/tree/master/dnsapi](https://github.com/Neilpang/acme.sh/tree/master/dnsapi) 阿里的就是:
|
||||||
|
|
||||||
|
```
|
||||||
|
export Ali_Key="sdfsdfsdfljlbjkljlkjsdfoiwje"
|
||||||
|
export Ali_Secret="jlsdflanljkljlfdsaklkjflsa"
|
||||||
|
acme.sh --issue --dns dns_ali -d example.com -d *.example.com
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
这个`*`值的就是泛域名。运行一次之后Ali_Key和Ali_Secret将被保存`~/.acme.sh/account.conf`,生成的SSL证书目录在`~/.acme.sh/example.com`
|
||||||
|
|
||||||
|
## 安装证书
|
||||||
|
|
||||||
|
>
|
||||||
|
详见:[copy/安装 证书](https://github.com/Neilpang/acme.sh/wiki/%E8%AF%B4%E6%98%8E#3-copy%E5%AE%89%E8%A3%85-%E8%AF%81%E4%B9%A6)
|
||||||
|
|
||||||
|
|
||||||
|
使用 `--installcert` 命令,并指定目标位置, 然后证书文件会被copy到相应的位置, 例如:
|
||||||
|
|
||||||
|
```
|
||||||
|
acme.sh --installcert -d <domain>.com \
|
||||||
|
--key-file /etc/nginx/ssl/<domain>.key \
|
||||||
|
--fullchain-file /etc/nginx/ssl/fullchain.cer \
|
||||||
|
--reloadcmd "service nginx force-reload"
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
宝塔用户在SSL选项选择其他证书,把SSL证书内容粘贴上面去就行了<br/>
|
||||||
|
<img alt="" src="http://image.creat.kim/picgo/20190314132922.png"/><br/>
|
||||||
|
这里改一下证书路径<br/>
|
||||||
|
<img alt="" src="http://image.creat.kim/picgo/20190314132617.png"/><br/>
|
||||||
|
目前证书在 60 天以后会自动更新, 你无需任何操作. 今后有可能会缩短这个时间, 不过都是自动的, 你不用关心.
|
||||||
|
|
||||||
|
## 更新 `acme.sh`
|
||||||
|
|
||||||
|
自动更新:`acme.sh --upgrade --auto-upgrade`<br/>
|
||||||
|
关闭更新:`acme.sh --upgrade --auto-upgrade 0`
|
||||||
|
|
||||||
|
有问题看 [wiki](https://github.com/Neilpang/acme.sh/wiki) 和 [dubug](https://github.com/Neilpang/acme.sh/wiki/How-to-debug-acme.sh)
|
|
@ -0,0 +1,181 @@
|
||||||
|
# 5.原创:Rclone笔记
|
||||||
|
|
||||||
|
>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 目录
|
||||||
|
|
||||||
|
## 一些简单命令
|
||||||
|
|
||||||
|
### 挂载
|
||||||
|
|
||||||
|
```
|
||||||
|
# windows 挂载命令
|
||||||
|
rclone mount OD:/ H: --cache-dir E:\ODPATH --vfs-cache-mode writes &
|
||||||
|
# linux 挂载命令
|
||||||
|
nohup rclone mount GD:/ /root/GDPATH --copy-links --no-gzip-encoding --no-check-certificate --allow-other --allow-non-empty --umask 000 &
|
||||||
|
# 取消挂载————linux 通用
|
||||||
|
fusermount -qzu /root/GDPATH 或者
|
||||||
|
fusermount -u /path/to/local/mount
|
||||||
|
# windows 取消挂载
|
||||||
|
umount /path/to/local/mount
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
### rclone命令
|
||||||
|
|
||||||
|
```
|
||||||
|
rclone ls
|
||||||
|
|
||||||
|
eg____rclone ls remote:path [flags]
|
||||||
|
ls # 递归列出 remote 所有文件及其大小,有点类似 tree 命令
|
||||||
|
lsl # 递归列出 remote 所有文件、大小及修改时间
|
||||||
|
lsd # 仅仅列出文件夹的修改时间和文件夹内的文件个数
|
||||||
|
|
||||||
|
lsf # 列出当前层级的文件或文件夹名称
|
||||||
|
lsjson # 以JSON格式列出文件和目录
|
||||||
|
|
||||||
|
|
||||||
|
rclone copy
|
||||||
|
|
||||||
|
eg____rclone copy OD:/SOME/PATH GD:/OTHER/PATH
|
||||||
|
--no-traverse # /path/to/src中有很多文件,但每天只有少数文件发生变化,加上这个参数可以提高传输速度
|
||||||
|
-P # 实时查看传输统计信息
|
||||||
|
--max-age 24h # 仅仅传输24小时内修改过的文件,默认关闭
|
||||||
|
rclone copy --max-age 24h --no-traverse /path/to/src remote:/PATH -P
|
||||||
|
|
||||||
|
rclone sync
|
||||||
|
eg____rclone sync source:path dest:path [flags]
|
||||||
|
# 使用该命令时先用 --dry-run 测试,明确要复制和删除的内容
|
||||||
|
|
||||||
|
rclone delete
|
||||||
|
# 列出大于 100M 的文件
|
||||||
|
rclone --min-size 100M lsl remote:path
|
||||||
|
# 删除测试
|
||||||
|
rclone --dry-run --min-size 100M delete remote:path
|
||||||
|
# 删除
|
||||||
|
rclone --min-size 100M delete remote:path
|
||||||
|
|
||||||
|
# 删除路径及其所有内容,filters此时无效,这与 delete 不同
|
||||||
|
rclone purge
|
||||||
|
|
||||||
|
# 删除空路径
|
||||||
|
rclone rmdir
|
||||||
|
|
||||||
|
# 删除路径下的空目录
|
||||||
|
rclone rmdirs
|
||||||
|
|
||||||
|
# 移动文件
|
||||||
|
rclone move
|
||||||
|
# 移动后删除空源目录
|
||||||
|
--delete-empty-src-dirs
|
||||||
|
|
||||||
|
# 检查源和目标匹配中的文件
|
||||||
|
rclone check
|
||||||
|
# 从两个源下载数据并在运行中互相检查它们而不是哈希检测
|
||||||
|
--download
|
||||||
|
|
||||||
|
rclone md5sum
|
||||||
|
# 为路径中的所有文件生成md5sum文件
|
||||||
|
rclone sha1sum
|
||||||
|
# 为路径中的所有文件生成sha1sum文件
|
||||||
|
rclone size
|
||||||
|
# 在remote:path中打印该路径下的文件总大小和数量
|
||||||
|
--json # 输出json格式
|
||||||
|
rclone version --check #检查版本更新
|
||||||
|
rclone cleanup # 清理源的垃圾箱或者旧版本文件
|
||||||
|
|
||||||
|
rclone dedupe # 以交互方式查找重复文件并删除/重命名它们
|
||||||
|
--dedupe-mode newest - 删除相同的文件,然后保留最新的文件,非交互方式
|
||||||
|
|
||||||
|
rclone cat
|
||||||
|
# 同linux
|
||||||
|
|
||||||
|
rclone copyto
|
||||||
|
# 将文件从源复制到dest,跳过已复制的文件
|
||||||
|
|
||||||
|
rclone gendocs output_directory [flags]
|
||||||
|
# 生成rclone的说明文档
|
||||||
|
|
||||||
|
rclone listremotes # 列出配置文件中所有源
|
||||||
|
--long 显示类型和名称 默认只显示名称
|
||||||
|
|
||||||
|
rclone moveto
|
||||||
|
# 不会传输未更改的文件
|
||||||
|
|
||||||
|
rclone cryptcheck /path/to/files encryptedremote:path
|
||||||
|
# 检查加密源的完整性
|
||||||
|
|
||||||
|
rclone about
|
||||||
|
# 获取源的配额 ,eg
|
||||||
|
$ rclone about ODA1P1:
|
||||||
|
Total: 5T
|
||||||
|
Used: 284.885G
|
||||||
|
Free: 4.668T
|
||||||
|
Trashed: 43.141G
|
||||||
|
--json # 以 json 格式输出
|
||||||
|
|
||||||
|
|
||||||
|
rclone mount # 挂载命令
|
||||||
|
|
||||||
|
# 在Windows使用则需要安装winfsp
|
||||||
|
--vfs-cache-mode # 不使用该参数,只能按顺序写入文件,只能在读取时查找,即windows程序无法操作文件,使用该参数即启用缓存机制
|
||||||
|
# 共四种模式:off|minimal|writes|full 缓存模式越高,rclone越多,代价是使用磁盘空间,默认为full
|
||||||
|
--vfs-cache-max-age 24h # 缓存24小时内修改过的文件
|
||||||
|
--vfs-cache-max-size 10g # 最大总缓存10g (缓存可能会超过此大小)
|
||||||
|
--cache-dir 指定缓存位置
|
||||||
|
--umask int 覆盖文件系统权限
|
||||||
|
--allow-non-empty 允许挂载在非空目录
|
||||||
|
--allow-other 允许其他用户访问
|
||||||
|
--no-check-certificate 不检查服务器SSL证书
|
||||||
|
--no-gzip-encoding 不设置接受gzip编码
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 用自己的 api 进行 gd 转存
|
||||||
|
|
||||||
|
>
|
||||||
|
见这位大佬博客:[https://www.moerats.com/archives/877/](https://www.moerats.com/archives/877/)
|
||||||
|
|
||||||
|
|
||||||
|
使用 `rclone` 的人太多吉会有一个问题,我们使用的是共享的`client_id`,在高峰期会出现`403`或者还没到`750G`限制就出现`Limitations`问题,所以高频率使用`rclone`转存谷歌文件得朋友就需要使用自己的`api`。通过上面那篇文章给出的方法获取谷歌的 API 客户端`ID`和客户端密钥,`rclone config`命令配置的时候,会有部分提示你输入,直接粘贴就`OK`.
|
||||||
|
|
||||||
|
挂载就变成:
|
||||||
|
|
||||||
|
```
|
||||||
|
#该参数主要是上传用的
|
||||||
|
/usr/bin/rclone mount DriveName:Folder LocalFolder \
|
||||||
|
--umask 0000 \
|
||||||
|
--default-permissions \
|
||||||
|
--allow-non-empty \
|
||||||
|
--allow-other \
|
||||||
|
--transfers 4 \
|
||||||
|
--buffer-size 32M \
|
||||||
|
--low-level-retries 200
|
||||||
|
|
||||||
|
#如果你还涉及到读取使用,比如使用H5ai等在线播放,就还建议加3个参数,添加格式参考上面
|
||||||
|
--dir-cache-time 12h
|
||||||
|
--vfs-read-chunk-size 32M
|
||||||
|
--vfs-read-chunk-size-limit 1G
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## 突破 Google Drive 服务端 750g 限制
|
||||||
|
|
||||||
|
谷歌官方明确限制通过第三方`api`每天限制转存`750G`文件,这个 `750G` 是直接通过谷歌服务端进行,文件没有经过客户端,另外经过客户端上传到 `gd` 与 服务端转存不冲突,官方也有 `750G` 限制,所以每天上传限额一共是 `1.5T`
|
||||||
|
|
||||||
|
```
|
||||||
|
# 一般用法,使用服务端API,不消耗本地流量
|
||||||
|
rclone copy GD1:/PATH GD2:/PATH
|
||||||
|
|
||||||
|
# disable server side copies 使用客户端 API,流量走客户端
|
||||||
|
rclone --disable copy GD1:/PATH GD2:/PATH
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
这样就是每天 `1.5T` 了。
|
||||||
|
|
||||||
|
## 谷歌文档限制
|
||||||
|
|
||||||
|
在 `rclone ls` 中谷歌文档会出现 `-1`, 而对于其他 `VFS` 层文件显示 `0` ,比喻通过 `rclone mount`,`rclone serve`操作的文件。而我们用 `rclone sync`,`rclone copy`的命令时,它会忽略文档大小而直接操作。也就是说如果你没有下载谷歌文档,就不知道它多大,没啥影响…
|
|
@ -0,0 +1,7 @@
|
||||||
|
# 6.转载:Office365 PC版修改更新频道
|
||||||
|
|
||||||
|
Office 365 PC版 默认为半年更新频道,可以修改为每月更新频道或其他频道,以体验最新功能。
|
||||||
|
|
||||||
|
>
|
||||||
|
原文链接:[https://www.mr-technos.com/forum.php?mod=viewthread&tid=79](https://www.mr-technos.com/forum.php?mod=viewthread&tid=79)
|
||||||
|
|
|
@ -0,0 +1,91 @@
|
||||||
|
# 7.原创:转存百度盘到gd/od的解决方案
|
||||||
|
|
||||||
|
**首页:**[HomePage](https://telegra.ph/HomePage-01-03)<br/>[https://telegra.ph/Fuck-PanBaidu-02-19](https://telegra.ph/Fuck-PanBaidu-02-19) <br/>[https://graph.org/Fuck-PanBaidu-02-19](https://graph.org/Fuck-PanBaidu-02-19)
|
||||||
|
|
||||||
|
### 一、安装aria2
|
||||||
|
|
||||||
|
```
|
||||||
|
wget -N https://git.io/aria2.sh && chmod +x aria2.sh && bash aria2.sh
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
启动:/etc/init.d/aria2 start
|
||||||
|
|
||||||
|
停止:/etc/init.d/aria2 stop
|
||||||
|
|
||||||
|
重启:/etc/init.d/aria2 restart
|
||||||
|
|
||||||
|
查看状态:/etc/init.d/aria2 status
|
||||||
|
|
||||||
|
配置文件:/root/.aria2/aria2.conf (配置文件包含中文注释,但是一些系统可能不支持显示中文)
|
||||||
|
|
||||||
|
令牌密匙:随机生成(可以改配置文件)
|
||||||
|
|
||||||
|
默认下载目录:/root/Download
|
||||||
|
|
||||||
|
### 二、aria2离线gd/od方案
|
||||||
|
|
||||||
|
1、安装rclone
|
||||||
|
|
||||||
|
```
|
||||||
|
curl https://rclone.org/install.sh | sudo bash
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
rclone配置可以参考:[https://rclone.org/drive/](https://rclone.org/drive/)
|
||||||
|
|
||||||
|
2、修改脚本 **/root/.aria2/autoupload.sh**
|
||||||
|
|
||||||
|
```
|
||||||
|
- name='Onedrive' #配置Rclone时的name- folder='/DRIVEX/Download' #网盘里的文件夹,留空为网盘根目录。
|
||||||
|
```
|
||||||
|
|
||||||
|
3、修改aria2配置文件:**/root/.aria2/aria2.conf 启用文件下载完成后脚本:**
|
||||||
|
|
||||||
|
```
|
||||||
|
- # 调用 rclone 上传(move)到网盘- on-download-complete=/root/.aria2/autoupload.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
4、重启 aria2
|
||||||
|
|
||||||
|
```
|
||||||
|
- /root/aria2.sh 选6重启- 或者运行:service aria2 restart
|
||||||
|
```
|
||||||
|
|
||||||
|
5、使用aria2前端面板进行文件下载:[aria2.ml](http://aria2.ml/)
|
||||||
|
|
||||||
|
填好vps端的aria2配置信息
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
点击新建粘贴下载链接进行文件下载
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
下载的文件会自动上传到gd/od
|
||||||
|
|
||||||
|
### 三、利用第三方百度盘
|
||||||
|
|
||||||
|
这里推荐速盘,可惜PanDownload没有开放aria2配置
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
如图,修改下载文件保存位置,GUI界面无法修改,请先退出软件,在config.ini文件中进行修改:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
其中下载文件保存位置与远程服务器的aria2的配置一样,比喻此方式安装的aria2就是**/root/Download**
|
||||||
|
|
||||||
|
于是就可以把你的百度网盘文件直接下载到gd/od中了。
|
||||||
|
|
||||||
|
### 四、效果图
|
||||||
|
|
||||||
|
1.使用AriaNG面板下载文件到VPS,利用**autoupload.sh脚本实现gd离线下载电影**
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
2.利用速盘远程aria2的功能实现将百度网盘文件远程下载到VPS,再利用**autoupload.sh脚本实现自动转存到gd**
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,15 @@
|
||||||
|
# ds19991999 的博文
|
||||||
|
1. [原创:Debian快速手动安装JupyterLab并配置Https](https://blog.csdn.net/ds19991999/article/details/88935996)
|
||||||
|
2. [原创:解决套路云Debian新机update的时候出现Waiting for headers和404错误](https://blog.csdn.net/ds19991999/article/details/88659452)
|
||||||
|
3. [原创:Jekyll 博客 Netlify CMS 后台部署](https://blog.csdn.net/ds19991999/article/details/88651187)
|
||||||
|
4. [原创:Let's Encrypt 泛域名证书申请](https://blog.csdn.net/ds19991999/article/details/88553810)
|
||||||
|
5. [原创:Rclone笔记](https://blog.csdn.net/ds19991999/article/details/88370053)
|
||||||
|
6. [转载:Office365 PC版修改更新频道](https://blog.csdn.net/ds19991999/article/details/87973325)
|
||||||
|
7. [原创:转存百度盘到gd/od的解决方案](https://blog.csdn.net/ds19991999/article/details/87736377)
|
||||||
|
8. [原创:以WebDav方式挂载OneDrive](https://blog.csdn.net/ds19991999/article/details/86506042)
|
||||||
|
9. [原创:接码平台分享](https://blog.csdn.net/ds19991999/article/details/86505762)
|
||||||
|
10. [原创:CSDN自定义友链侧边栏](https://blog.csdn.net/ds19991999/article/details/86505686)
|
||||||
|
11. [原创:资源分享](https://blog.csdn.net/ds19991999/article/details/85225611)
|
||||||
|
12. [原创:Windows上挂载OneDrive为本地硬盘](https://blog.csdn.net/ds19991999/article/details/85008885)
|
||||||
|
13. [原创:Ubuntu使用日常](https://blog.csdn.net/ds19991999/article/details/83719417)
|
||||||
|
14. [原创:彻底解决Ubuntu联网问题——网速飞起](https://blog.csdn.net/ds19991999/article/details/83715489)
|
Binary file not shown.
After Width: | Height: | Size: 440 KiB |
Binary file not shown.
After Width: | Height: | Size: 19 KiB |
Binary file not shown.
After Width: | Height: | Size: 67 KiB |
Binary file not shown.
After Width: | Height: | Size: 44 KiB |
Binary file not shown.
After Width: | Height: | Size: 94 KiB |
|
@ -0,0 +1,13 @@
|
||||||
|
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
|
||||||
|
Accept-Encoding: gzip, deflate, br
|
||||||
|
Accept-Language: zh-CN,zh;q=0.9
|
||||||
|
Cache-Control: max-age=0
|
||||||
|
Connection: keep-alive
|
||||||
|
Cookie: acw_tc=2760829715714827204377171e8e9dc3a79185500e46805511b2c277adf1fb; acw_sc__v3=5daaec608ce6c5ba1fab0c4137c00ecb0cd34525; uuid_tt_dd=10_2450623130-1571482720624-229726; dc_session_id=10_1571482720624.999633; acw_sc__v2=5daaec6067f5ec51b728d2bd7660bf7372ed8903; TY_SESSION_ID=c82ca68f-e408-4c15-b681-71da67f637c2; dc_tos=pzmbtt; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_2450623130-1571482720624-229726; c-login-auto=1; announcement=%257B%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F102605809%2522%252C%2522announcementCount%2522%253A1%252C%2522announcementExpire%2522%253A527116621%257D
|
||||||
|
Host: blog.csdn.net
|
||||||
|
Referer: https://blog.csdn.net/
|
||||||
|
Sec-Fetch-Mode: navigate
|
||||||
|
Sec-Fetch-Site: same-origin
|
||||||
|
Sec-Fetch-User: ?1
|
||||||
|
Upgrade-Insecure-Requests: 1
|
||||||
|
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36
|
|
@ -0,0 +1,208 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
# coding: utf-8
|
||||||
|
|
||||||
|
import os, time, re
|
||||||
|
import requests
|
||||||
|
import threading
|
||||||
|
import logging
|
||||||
|
from bs4 import BeautifulSoup, Comment
|
||||||
|
from selenium import webdriver
|
||||||
|
from tomd import Tomd
|
||||||
|
|
||||||
|
|
||||||
|
def result_file(folder_name, file_name):
|
||||||
|
folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "articles", folder_name)
|
||||||
|
if not os.path.exists(folder):
|
||||||
|
os.makedirs(folder)
|
||||||
|
path = os.path.join(folder, file_name)
|
||||||
|
file = open(path,"w")
|
||||||
|
file.close()
|
||||||
|
else:
|
||||||
|
path = os.path.join(folder, file_name)
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def get_headers(cookie_path:str):
|
||||||
|
cookies = {}
|
||||||
|
with open(cookie_path, "r", encoding="utf-8") as f:
|
||||||
|
cookie_list = f.readlines()
|
||||||
|
for line in cookie_list:
|
||||||
|
cookie = line.split(":")
|
||||||
|
cookies[cookie[0]] = str(cookie[1]).strip()
|
||||||
|
return cookies
|
||||||
|
|
||||||
|
|
||||||
|
def delete_ele(soup:BeautifulSoup, tags:list):
|
||||||
|
for ele in tags:
|
||||||
|
for useless_tag in soup.select(ele):
|
||||||
|
useless_tag.decompose()
|
||||||
|
|
||||||
|
|
||||||
|
def delete_ele_attr(soup:BeautifulSoup, attrs:list):
|
||||||
|
for attr in attrs:
|
||||||
|
for useless_attr in soup.find_all():
|
||||||
|
del useless_attr[attr]
|
||||||
|
|
||||||
|
|
||||||
|
def delete_blank_ele(soup:BeautifulSoup, eles_except:list):
|
||||||
|
for useless_attr in soup.find_all():
|
||||||
|
try:
|
||||||
|
if useless_attr.name not in eles_except and useless_attr.text == "":
|
||||||
|
useless_attr.decompose()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class TaskQueue(object):
|
||||||
|
def __init__(self):
|
||||||
|
self.VisitedList = []
|
||||||
|
self.UnVisitedList = []
|
||||||
|
|
||||||
|
def getVisitedList(self):
|
||||||
|
return self.VisitedList
|
||||||
|
|
||||||
|
def getUnVisitedList(self):
|
||||||
|
return self.UnVisitedList
|
||||||
|
|
||||||
|
def InsertVisitedList(self, url):
|
||||||
|
if url not in self.VisitedList:
|
||||||
|
self.VisitedList.append(url)
|
||||||
|
|
||||||
|
def InsertUnVisitedList(self, url):
|
||||||
|
if url not in self.UnVisitedList:
|
||||||
|
self.UnVisitedList.append(url)
|
||||||
|
|
||||||
|
def RemoveVisitedList(self, url):
|
||||||
|
self.VisitedList.remove(url)
|
||||||
|
|
||||||
|
def PopUnVisitedList(self,index=0):
|
||||||
|
url = ""
|
||||||
|
if index and self.UnVisitedList:
|
||||||
|
url = self.UnVisitedList[index]
|
||||||
|
del self.UnVisitedList[:index]
|
||||||
|
elif self.UnVisitedList:
|
||||||
|
url = self.UnVisitedList.pop()
|
||||||
|
return url
|
||||||
|
|
||||||
|
def getUnVisitedListLength(self):
|
||||||
|
return len(self.UnVisitedList)
|
||||||
|
|
||||||
|
|
||||||
|
class Article(object):
|
||||||
|
def __init__(self):
|
||||||
|
self.options = webdriver.ChromeOptions()
|
||||||
|
self.options.add_experimental_option('excludeSwitches', ['enable-logging'])
|
||||||
|
self.options.add_argument('headless')
|
||||||
|
self.browser = webdriver.Chrome(options=self.options)
|
||||||
|
# 设置全局智能等待时间
|
||||||
|
self.browser.implicitly_wait(30)
|
||||||
|
|
||||||
|
def get_content(self, url):
|
||||||
|
self.browser.get(url)
|
||||||
|
try:
|
||||||
|
self.browser.find_element_by_xpath('//a[@class="btn-readmore"]').click()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
content = self.browser.find_element_by_xpath('//div[@id="content_views"]').get_attribute("innerHTML")
|
||||||
|
return content
|
||||||
|
|
||||||
|
def get_md(self, url):
|
||||||
|
"""
|
||||||
|
转换为markdown格式
|
||||||
|
"""
|
||||||
|
content = self.get_content(url)
|
||||||
|
soup = BeautifulSoup(content, 'lxml')
|
||||||
|
# 删除注释
|
||||||
|
for useless_tag in soup(text=lambda text: isinstance(text, Comment)):
|
||||||
|
useless_tag.extract()
|
||||||
|
# 删除无用标签
|
||||||
|
tags = ["svg", "ul", ".hljs-button.signin"]
|
||||||
|
delete_ele(soup, tags)
|
||||||
|
# 删除标签属性
|
||||||
|
attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"]
|
||||||
|
delete_ele_attr(soup,attrs)
|
||||||
|
# 删除空白标签
|
||||||
|
eles_except = ["img", "br", "hr"]
|
||||||
|
delete_blank_ele(soup, eles_except)
|
||||||
|
# 转换为markdown
|
||||||
|
md = Tomd(str(soup)).markdown
|
||||||
|
return md
|
||||||
|
|
||||||
|
|
||||||
|
class CSDN(object):
|
||||||
|
def __init__(self, cookie_path):
|
||||||
|
self.headers = get_headers(cookie_path)
|
||||||
|
self.TaskQueue = TaskQueue()
|
||||||
|
|
||||||
|
def get_articles(self, username:str):
|
||||||
|
"""获取文章标题和链接"""
|
||||||
|
num = 0
|
||||||
|
while True:
|
||||||
|
num += 1
|
||||||
|
url = u'https://blog.csdn.net/' + username + '/article/list/' + str(num)
|
||||||
|
response = requests.get(url=url, headers=self.headers)
|
||||||
|
html = response.text
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"})
|
||||||
|
if len(articles) > 0:
|
||||||
|
for article in articles:
|
||||||
|
article_title = article.a.text.strip().replace(' ',':')
|
||||||
|
article_href = article.a['href']
|
||||||
|
yield article_title,article_href
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
def write_articals(self, username:str):
|
||||||
|
"""将博文写入本地"""
|
||||||
|
print("[++] 正在爬取 {} 的博文......".format(username))
|
||||||
|
artical = Article()
|
||||||
|
reademe_path = result_file(username,file_name="README.md")
|
||||||
|
with open(reademe_path,'w', encoding='utf-8') as reademe_file:
|
||||||
|
i = 1
|
||||||
|
readme_head = "# " + username + " 的博文\n"
|
||||||
|
reademe_file.write(readme_head)
|
||||||
|
for article_title,article_href in self.get_articles(username):
|
||||||
|
print("[++++] {}. 正在处理URL:{}".format(str(i), article_href))
|
||||||
|
text = str(i) + '. [' + article_title + ']('+ article_href +')\n'
|
||||||
|
reademe_file.write(text)
|
||||||
|
file_name = str(i) + "." + re.sub(r'[\/::*?"<>|]','-', article_title) + ".md"
|
||||||
|
artical_path = result_file(folder_name=username, file_name=file_name)
|
||||||
|
md_content = artical.get_md(article_href)
|
||||||
|
md_head = "# " + str(i) + "." + article_title + "\n"
|
||||||
|
md = md_head + md_content
|
||||||
|
with open(artical_path, "w", encoding="utf-8") as artical_file:
|
||||||
|
artical_file.write(md)
|
||||||
|
i += 1
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
def spider(self):
|
||||||
|
"""将爬取到的文章保存到本地"""
|
||||||
|
while True:
|
||||||
|
if self.TaskQueue.getUnVisitedListLength():
|
||||||
|
username = self.TaskQueue.PopUnVisitedList()
|
||||||
|
self.write_articals(username)
|
||||||
|
|
||||||
|
def check_user(self, user_path:str):
|
||||||
|
with open(user_path, 'r', encoding='utf-8') as f:
|
||||||
|
users = f.readlines()
|
||||||
|
for user in users:
|
||||||
|
self.TaskQueue.InsertUnVisitedList(user.strip())
|
||||||
|
|
||||||
|
def run(self, user_path):
|
||||||
|
UserThread = threading.Thread(target=self.check_user, args=(user_path,))
|
||||||
|
SpiderThread = threading.Thread(target=self.spider, args=())
|
||||||
|
UserThread.start()
|
||||||
|
SpiderThread.start()
|
||||||
|
UserThread.join()
|
||||||
|
SpiderThread.join()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
user_path = 'username.txt'
|
||||||
|
csdn = CSDN('cookie.txt')
|
||||||
|
csdn.run(user_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
|
@ -0,0 +1,3 @@
|
||||||
|
bs4==0.0.1
|
||||||
|
selenium==3.141.0
|
||||||
|
requests==2.22.0
|
|
@ -0,0 +1,155 @@
|
||||||
|
import re
|
||||||
|
|
||||||
|
__all__ = ['Tomd', 'convert']
|
||||||
|
|
||||||
|
MARKDOWN = {
|
||||||
|
'h1': ('\n# ', '\n'),
|
||||||
|
'h2': ('\n## ', '\n'),
|
||||||
|
'h3': ('\n### ', '\n'),
|
||||||
|
'h4': ('\n#### ', '\n'),
|
||||||
|
'h5': ('\n##### ', '\n'),
|
||||||
|
'h6': ('\n###### ', '\n'),
|
||||||
|
'code': ('`', '`'),
|
||||||
|
'ul': ('', ''),
|
||||||
|
'ol': ('', ''),
|
||||||
|
'li': ('- ', ''),
|
||||||
|
'blockquote': ('\n> ', '\n'),
|
||||||
|
'em': ('**', '**'),
|
||||||
|
'strong': ('**', '**'),
|
||||||
|
'block_code': ('\n```\n', '\n```\n'),
|
||||||
|
'span': ('', ''),
|
||||||
|
'p': ('\n', '\n'),
|
||||||
|
'p_with_out_class': ('\n', '\n'),
|
||||||
|
'inline_p': ('', ''),
|
||||||
|
'inline_p_with_out_class': ('', ''),
|
||||||
|
'b': ('**', '**'),
|
||||||
|
'i': ('*', '*'),
|
||||||
|
'del': ('~~', '~~'),
|
||||||
|
'hr': ('\n---', '\n\n'),
|
||||||
|
'thead': ('\n', '|------\n'),
|
||||||
|
'tbody': ('\n', '\n'),
|
||||||
|
'td': ('|', ''),
|
||||||
|
'th': ('|', ''),
|
||||||
|
'tr': ('', '\n')
|
||||||
|
}
|
||||||
|
|
||||||
|
BlOCK_ELEMENTS = {
|
||||||
|
'h1': '<h1.*?>(.*?)</h1>',
|
||||||
|
'h2': '<h2.*?>(.*?)</h2>',
|
||||||
|
'h3': '<h3.*?>(.*?)</h3>',
|
||||||
|
'h4': '<h4.*?>(.*?)</h4>',
|
||||||
|
'h5': '<h5.*?>(.*?)</h5>',
|
||||||
|
'h6': '<h6.*?>(.*?)</h6>',
|
||||||
|
'hr': '<hr/>',
|
||||||
|
'blockquote': '<blockquote.*?>(.*?)</blockquote>',
|
||||||
|
'ul': '<ul.*?>(.*?)</ul>',
|
||||||
|
'ol': '<ol.*?>(.*?)</ol>',
|
||||||
|
'block_code': '<pre.*?><code.*?>(.*?)</code></pre>',
|
||||||
|
'p': '<p\s.*?>(.*?)</p>',
|
||||||
|
'p_with_out_class': '<p>(.*?)</p>',
|
||||||
|
'thead': '<thead.*?>(.*?)</thead>',
|
||||||
|
'tr': '<tr>(.*?)</tr>'
|
||||||
|
}
|
||||||
|
|
||||||
|
INLINE_ELEMENTS = {
|
||||||
|
'td': '<td>(.*?)</td>',
|
||||||
|
'tr': '<tr>(.*?)</tr>',
|
||||||
|
'th': '<th>(.*?)</th>',
|
||||||
|
'b': '<b>(.*?)</b>',
|
||||||
|
'i': '<i>(.*?)</i>',
|
||||||
|
'del': '<del>(.*?)</del>',
|
||||||
|
'inline_p': '<p\s.*?>(.*?)</p>',
|
||||||
|
'inline_p_with_out_class': '<p>(.*?)</p>',
|
||||||
|
'code': '<code.*?>(.*?)</code>',
|
||||||
|
'span': '<span.*?>(.*?)</span>',
|
||||||
|
'ul': '<ul.*?>(.*?)</ul>',
|
||||||
|
'ol': '<ol.*?>(.*?)</ol>',
|
||||||
|
'li': '<li.*?>(.*?)</li>',
|
||||||
|
'img': '<img.*?src="(.*?)".*?>(.*?)</img>',
|
||||||
|
'a': '<a.*?href="(.*?)".*?>(.*?)</a>',
|
||||||
|
'em': '<em.*?>(.*?)</em>',
|
||||||
|
'strong': '<strong.*?>(.*?)</strong>'
|
||||||
|
}
|
||||||
|
|
||||||
|
DELETE_ELEMENTS = ['<span.*?>', '</span>', '<div.*?>', '</div>']
|
||||||
|
|
||||||
|
|
||||||
|
class Element:
|
||||||
|
def __init__(self, start_pos, end_pos, content, tag, is_block=False):
|
||||||
|
self.start_pos = start_pos
|
||||||
|
self.end_pos = end_pos
|
||||||
|
self.content = content
|
||||||
|
self._elements = []
|
||||||
|
self.is_block = is_block
|
||||||
|
self.tag = tag
|
||||||
|
self._result = None
|
||||||
|
|
||||||
|
if self.is_block:
|
||||||
|
self.parse_inline()
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
wrapper = MARKDOWN.get(self.tag)
|
||||||
|
self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1])
|
||||||
|
return self._result
|
||||||
|
|
||||||
|
def parse_inline(self):
|
||||||
|
for tag, pattern in INLINE_ELEMENTS.items():
|
||||||
|
|
||||||
|
if tag == 'a':
|
||||||
|
self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content)
|
||||||
|
elif tag == 'img':
|
||||||
|
self.content = re.sub(pattern, '![\g<2>](\g<1>)', self.content)
|
||||||
|
elif self.tag == 'ul' and tag == 'li':
|
||||||
|
self.content = re.sub(pattern, '- \g<1>', self.content)
|
||||||
|
elif self.tag == 'ol' and tag == 'li':
|
||||||
|
self.content = re.sub(pattern, '1. \g<1>', self.content)
|
||||||
|
elif self.tag == 'thead' and tag == 'tr':
|
||||||
|
self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', ''))
|
||||||
|
elif self.tag == 'tr' and tag == 'th':
|
||||||
|
self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))
|
||||||
|
elif self.tag == 'tr' and tag == 'td':
|
||||||
|
self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))
|
||||||
|
else:
|
||||||
|
wrapper = MARKDOWN.get(tag)
|
||||||
|
self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content)
|
||||||
|
|
||||||
|
|
||||||
|
class Tomd:
|
||||||
|
def __init__(self, html='', options=None):
|
||||||
|
self.html = html
|
||||||
|
self.options = options
|
||||||
|
self._markdown = ''
|
||||||
|
|
||||||
|
def convert(self, html, options=None):
|
||||||
|
elements = []
|
||||||
|
for tag, pattern in BlOCK_ELEMENTS.items():
|
||||||
|
for m in re.finditer(pattern, html, re.I | re.S | re.M):
|
||||||
|
element = Element(start_pos=m.start(),
|
||||||
|
end_pos=m.end(),
|
||||||
|
content=''.join(m.groups()),
|
||||||
|
tag=tag,
|
||||||
|
is_block=True)
|
||||||
|
can_append = True
|
||||||
|
for e in elements:
|
||||||
|
if e.start_pos < m.start() and e.end_pos > m.end():
|
||||||
|
can_append = False
|
||||||
|
elif e.start_pos > m.start() and e.end_pos < m.end():
|
||||||
|
elements.remove(e)
|
||||||
|
if can_append:
|
||||||
|
elements.append(element)
|
||||||
|
|
||||||
|
elements.sort(key=lambda element: element.start_pos)
|
||||||
|
self._markdown = ''.join([str(e) for e in elements])
|
||||||
|
|
||||||
|
for index, element in enumerate(DELETE_ELEMENTS):
|
||||||
|
self._markdown = re.sub(element, '', self._markdown)
|
||||||
|
return self._markdown
|
||||||
|
|
||||||
|
@property
|
||||||
|
def markdown(self):
|
||||||
|
self.convert(self.html, self.options)
|
||||||
|
return self._markdown
|
||||||
|
|
||||||
|
|
||||||
|
_inst = Tomd()
|
||||||
|
convert = _inst.convert
|
|
@ -0,0 +1 @@
|
||||||
|
ds19991999
|
Loading…
Reference in New Issue