# CSDN 爬虫脚本

主要功能:爬取 `csdn` 博客指定用户的所有博文并转换为 `markdown` 格式保存到本地。

## 一、运行环境

需要安装`WebDriver`驱动,https://chromedriver.chromium.org/downloads,下载与本地对应的`chrome`驱动后,将其添加至环境变量`$PATH`

```shell
python3
python3 -m pip install -r requirements.txt
```

## 二、获取脚本

```shell
git clone https://github.com/ds19991999/csdn-spider.git
```

## 三、用法

### 1.获取cookie

登录 `csdn` 账号,进入:https://blog.csdn.net ,按 `F12` 调试网页,复制所有的 `Request Headers`,保存到`cookie.txt`文件中

![1571482112632](assets/1571482112632.png)

### 2.添加需要爬取的 `csdn` 用户

在`username.txt`中添加用户名,一行一个

### 3.运行脚本

```shell
python3 csdn.py
```

## 四、效果

**运行过程**

![1571483423256](assets/1571483423256.png)

**文章列表建立**:`./articles/username/README.md`

![1571483552438](assets/1571483552438.png)

**爬取的博文**:`./articles/username/`

![1571483479356](assets/1571483479356.png)

**博文转换效果**:

![1571483777703](assets/1571483777703.png)

## 五、LICENSE

Creative Commons License

`PS`:随意写的爬虫脚本,佛系更新。
+ + +``` +sudo apt-get install software-properties-common + +``` + +## 安装`Python`环境 + +``` +sudo apt-get install python-pip python-dev build-essential +sudo pip install --upgrade pip +sudo pip install --upgrade virtualenv +sudo apt-get install python-setuptools python-dev build-essential +sudo easy_install pip +sudo pip install --upgrade virtualenv +sudo apt-get install python3-pip +sudo apt-get install python-pip +sudo pip3 install --upgrade pip +sudo pip2 install --upgrade pip +sudo pip install --upgrade pip + +``` + +## 查看`pip`指向 + +``` +~ $which pip +/usr/local/bin/pip +21:36 alien@alien-Inspiron-3443: +~ $which pip2 +/usr/local/bin/pip2 +21:36 alien@alien-Inspiron-3443: +~ $which pip3 +/usr/local/bin/pip3 + +``` + +## 安装`yarn` + +``` +curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add - +echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list +sudo apt-get update +sudo apt-get install yarn + +``` + +## 安装`nodejs` + +``` +curl -sL https://deb.nodesource.com/setup_10.x | bash - +apt-get install -y nodejs + +``` + +## 安装`jupyterlab` + +``` +sudo pip2 install jupyterlab + +``` + +## 配置`jupyerlab` + +``` +jupyter-notebook password + +``` + +进入`ipython`设置哈希密码,这里输入的是你登陆`jupyter lab`的密码,记下生成的哈希密码. + +``` +ipython +from notebook.auth import passwd +passwd() +# 输入你自己设置登录JupyterLab界面的密码, +# 然后就会生产下面这样的密码,将它记下来,待会儿用 +'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704' + +``` + +## 编辑配置文件 + +一般在`/root/.jupyter/jupyter_notebook_config.py`中,找到并修改以下配置项。 + +``` +c.NotebookApp.allow_root = True +c.NotebookApp.ip = '' +c.NotebookApp.notebook_dir = u'/root/JupyterLab' +c.NotebookApp.open_browser = False +c.NotebookApp.password = u'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704' +c.NotebookApp.port = 8888 + +# 解释以上各项 +允许以root方式运行jupyterlab +允许任意ip段访问 +设置jupyterlab页面的根目录 +默认运行时不启动浏览器,因为服务器默认只有终端嘛 +设置之前生产的哈希密码 +设置访问端口,与下面的caddy需一致 + +``` + +## 运行`Jupyter Lab` + +``` +jupyter-lab --version +jupyter lab build + +mkdir ~/JupyterLab +cd ~/JupyterLab + +# 方便后台运行 +apt install screen +screen -S jupterlab +jupyter lab + +``` + +`ctrl+A+D`退出这个窗口。 + +## `caddy`开启`https`反代 + +域名改成你自己的,`caddy`详细使用见:[`【传送门】`](https://www.creat.kim/archives/18/) + +``` +wget -N --no-check-certificate https://raw.githubusercontent.com/ds19991999/shell.sh/shell/caddy_install.sh && chmod +x caddy_install.sh && bash caddy_install.sh + +echo "jupyter.creat.kim + gzip + tls cva.engineer.ding@gmail.com + proxy / { + transparent + websocket + }" > /usr/local/caddy/Caddyfile + +``` + +## 定时备份到`GitHub` + +见大佬写的比较详细的文章:[`【传送门】`](https://www.moerats.com/archives/858/) + +## 配置`python2`和`python3`内核 + +好人做到底吧,这里肯定很多人踩坑。。。用`pip`安装包的时候千万不要用`pip3 install ***`或者`pip2 install ***`呀. + +``` +python2 -m pip install ipykernel ipython matplotlib scipy pandas numpy +python3 -m pip install ipykernel ipython matplotlib scipy pandas numpy + +``` + +检查一下内核 + +``` +root@google:~/JupyterLab# jupyter kernelspec list +Available kernels: + python2 /usr/local/share/jupyter/kernels/python2 + python3 /usr/local/share/jupyter/kernels/python3 + +``` + +好了,访问域名,开始使用吧。 + +--- + + +## 最后一点思悟 + +大概这是我发在`CSDN`最后的博文了,本文来自 [https://www.creat.kim/archives/40/](https://www.creat.kim/archives/40/) ,不错,终于抛弃公共博客平台了。我在`CSDN`写了差不多一年半左右的博文吧,共`107`篇,其中`97`篇原(chao)创(xi),`7`篇转载,`2`篇私密,`1`篇因违反相关政策被管理员设为私密 … 博客`CSDN`排名`10k+`,访问量`225k+`,粉丝数`48`,表现平平,博文水平一般,算是代表了大部分人吧。 + +国内的博客平台其实都不错,`CSDN` 的写作体验也非常好,我曾经也一度在自己的博客平台或者公共博客平台之间徘徊,慢慢的最初写博客的意义就变味了,不过经历过这个过程,大概就明白了一些事吧。 + +在尝试`WordPress` 、`知乎` 、`简书`、`博客园`、`新浪`、`GitHub-Jekyll` 、`coding-jekyll`、`hexo` 、`Typecho`…之后,了解了一些网站运行常识,最起码知道国内的都是需要备案的 …
+在图床方面,从最初的直接复制粘贴到`GitHub`+`PicGo`、`又拍云` (需要备案)、`七牛云`(需要备案)、自建图床…明白了一些`CDN`加速技巧 …
+在文档方面,从最初的直接编辑,到`CSDN`的`MarkDown`编辑器、`有道云笔记`、`Evernote`(分国外国内版本)、`GitHub-README`、`GitBook`、`MkDoc`、`Read the Docs`、`Sphinx`、`Docsify`,明白了孰能生巧,熟练的话,什么文本都能写的漂亮,虽然我至今不会`Vim` …
+在服务器选择上面,国内和国外的差异,也了解了不少,也越来越深恶痛绝 `install` 一个包或者一个`程序`的时候,你就那么几`k`几`b`的跑,国内源再怎么换,也比不上国外源的速度,有些网站虽然没有被`q`,你本地那速度受的了吗,现在也服气当初我是怎么忍受那龟一般的网速。看到过,了解过,才能从另一个角度看待问题,总比一直看被经过过滤的信息强吧。 + +再看看国外的教育福利,有人说是国外被中国人撸羊毛撸怕了,所以就不给中国提供教育福利。但是你看看国内大厂的教育福利,那服务器多便宜,我自己都眼馋,赶紧去每个厂注册一个号。要求实名,好,我实名,我传照片;要求备案,啥,还备案,好,我备案,我传照片,又是一个星期;这咋还有监测呢,忍不了了 … 这像不像裸贷,你只要用身份证实名,把自己的靓照交给他,他就给你提供廉价的服务器,这里说的有点过了,哈哈哈。前不久谷歌也要求中国IP注册地需要传照片了,唯独中国。国外在教育方面的投资我们真的要好好学习学习 … + +之前的`12306事件`、`蓝灯事件`、`某某数据库泄露`,真真假假假亦真。身在国内,就不得不用隐私换取便利。 diff --git a/articles/ds19991999/2.原创-解决套路云Debian新机update的时候出现Waiting for headers和404错误.md b/articles/ds19991999/2.原创-解决套路云Debian新机update的时候出现Waiting for headers和404错误.md new file mode 100644 index 0000000..72158bb --- /dev/null +++ b/articles/ds19991999/2.原创-解决套路云Debian新机update的时候出现Waiting for headers和404错误.md @@ -0,0 +1,33 @@ +# 2.原创:解决套路云Debian新机update的时候出现Waiting for headers和404错误 + +``` +rm -rf /root/.pip /root/.pydistutils.cfg /etc/apt/sources.list.d/sources-aliyun-0.list /etc/apt/sources.list.d/sources-aliyun* /var/lib/apt/lists/* + +``` + +``` +deb http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free +deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free +deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib +deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib +deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free +deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free + +## Uncomment the following two lines to add software from the 'backports' +## repository. +## +## N.B. software from this repository may not have been tested as +## extensively as that contained in the main release, although it includes +## newer versions of some applications which may provide useful features. +#deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free +#deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free + +``` + +``` +apt-get clean +apt-get update + +``` + +套路云还是套路云,服气!!! diff --git a/articles/ds19991999/3.原创-Jekyll 博客 Netlify CMS 后台部署.md b/articles/ds19991999/3.原创-Jekyll 博客 Netlify CMS 后台部署.md new file mode 100644 index 0000000..2c669c3 --- /dev/null +++ b/articles/ds19991999/3.原创-Jekyll 博客 Netlify CMS 后台部署.md @@ -0,0 +1,3 @@ +# 3.原创:Jekyll 博客 Netlify CMS 后台部署 + +### 文章目录 diff --git a/articles/ds19991999/4.原创-Let's Encrypt 泛域名证书申请.md b/articles/ds19991999/4.原创-Let's Encrypt 泛域名证书申请.md new file mode 100644 index 0000000..88e7bcd --- /dev/null +++ b/articles/ds19991999/4.原创-Let's Encrypt 泛域名证书申请.md @@ -0,0 +1,71 @@ +# 4.原创:Let's Encrypt 泛域名证书申请 + +> +github: [https://github.com/Neilpang/acme.sh](https://github.com/Neilpang/acme.sh) + + +通过acme申请Let’s Encrypt证书支持的域名DNS服务商有以下这些(国内用户较多的):`cloudxns、dnspod、aliyun(阿里云)、cloudflare、linode、he、digitalocean、namesilo、aws、namecom、freedns、godaddy、yandex` 等等。 + +### 目录 + +## [安装acm.sh](http://xn--acm-pd0fq01r.sh) + +``` +curl https://get.acme.sh | sh + +``` + +`acme.sh`被安装在了`~./.acme.sh`,创建 一个 `bash` 的 `alias`, 方便你的使用: `alias acme.sh=~/.acme.sh/acme.sh` + +通过`acme.sh`安装的证书会自动为你创建 `cronjob`, 每天 0:00 点自动检测所有的证书, 如果快过期了, 需要更新, 则会自动更新证书. + +## DNS方式验证域名所有权 + +``` +acme.sh --issue --dns -d mydomain.com + +``` + +`acme.sh` 会生成相应的解析记录显示出来, 你只需要在你的域名管理面板中添加这条 `txt` 记录即可. + +## 获取`DNS API` + +获取`DNS`域名商的`DNS API` ,`api` 也会将 上面的`txt` 记录自动添加到域名解析商。比喻阿里的`api`:[https://ak-console.aliyun.com/#/accesskey](https://ak-console.aliyun.com/#/accesskey) ,然后看说明进行配置 [https://github.com/Neilpang/acme.sh/tree/master/dnsapi](https://github.com/Neilpang/acme.sh/tree/master/dnsapi) 阿里的就是: + +``` +export Ali_Key="sdfsdfsdfljlbjkljlkjsdfoiwje" +export Ali_Secret="jlsdflanljkljlfdsaklkjflsa" +acme.sh --issue --dns dns_ali -d example.com -d *.example.com + +``` + +这个`*`值的就是泛域名。运行一次之后Ali_Key和Ali_Secret将被保存`~/.acme.sh/account.conf`,生成的SSL证书目录在`~/.acme.sh/example.com` + +## 安装证书 + +> +详见:[copy/安装 证书](https://github.com/Neilpang/acme.sh/wiki/%E8%AF%B4%E6%98%8E#3-copy%E5%AE%89%E8%A3%85-%E8%AF%81%E4%B9%A6) + + +使用 `--installcert` 命令,并指定目标位置, 然后证书文件会被copy到相应的位置, 例如: + +``` +acme.sh --installcert -d <domain>.com \ + --key-file /etc/nginx/ssl/<domain>.key \ + --fullchain-file /etc/nginx/ssl/fullchain.cer \ + --reloadcmd "service nginx force-reload" + +``` + +宝塔用户在SSL选项选择其他证书,把SSL证书内容粘贴上面去就行了
+目前证书在 60 天以后会自动更新, 你无需任何操作. 今后有可能会缩短这个时间, 不过都是自动的, 你不用关心. + +## 更新 `acme.sh` + +自动更新:`acme.sh --upgrade --auto-upgrade`
+关闭更新:`acme.sh --upgrade --auto-upgrade 0` + +有问题看 [wiki](https://github.com/Neilpang/acme.sh/wiki) 和 [dubug](https://github.com/Neilpang/acme.sh/wiki/How-to-debug-acme.sh) diff --git a/articles/ds19991999/5.原创-Rclone笔记.md b/articles/ds19991999/5.原创-Rclone笔记.md new file mode 100644 index 0000000..081127e --- /dev/null +++ b/articles/ds19991999/5.原创-Rclone笔记.md @@ -0,0 +1,181 @@ +# 5.原创:Rclone笔记 + +> + + + +### 目录 + +## 一些简单命令 + +### 挂载 + +``` +# windows 挂载命令 +rclone mount OD:/ H: --cache-dir E:\ODPATH --vfs-cache-mode writes & +# linux 挂载命令 +nohup rclone mount GD:/ /root/GDPATH --copy-links --no-gzip-encoding --no-check-certificate --allow-other --allow-non-empty --umask 000 & +# 取消挂载————linux 通用 +fusermount -qzu /root/GDPATH 或者 +fusermount -u /path/to/local/mount +# windows 取消挂载 +umount /path/to/local/mount + +``` + +### rclone命令 + +``` +rclone ls + +eg____rclone ls remote:path [flags] +ls # 递归列出 remote 所有文件及其大小,有点类似 tree 命令 +lsl # 递归列出 remote 所有文件、大小及修改时间 +lsd # 仅仅列出文件夹的修改时间和文件夹内的文件个数 + +lsf # 列出当前层级的文件或文件夹名称 +lsjson # 以JSON格式列出文件和目录 + + +rclone copy + +eg____rclone copy OD:/SOME/PATH GD:/OTHER/PATH +--no-traverse # /path/to/src中有很多文件,但每天只有少数文件发生变化,加上这个参数可以提高传输速度 +-P # 实时查看传输统计信息 +--max-age 24h # 仅仅传输24小时内修改过的文件,默认关闭 +rclone copy --max-age 24h --no-traverse /path/to/src remote:/PATH -P + +rclone sync +eg____rclone sync source:path dest:path [flags] +# 使用该命令时先用 --dry-run 测试,明确要复制和删除的内容 + +rclone delete +# 列出大于 100M 的文件 +rclone --min-size 100M lsl remote:path +# 删除测试 +rclone --dry-run --min-size 100M delete remote:path +# 删除 +rclone --min-size 100M delete remote:path + +# 删除路径及其所有内容,filters此时无效,这与 delete 不同 +rclone purge + +# 删除空路径 +rclone rmdir + +# 删除路径下的空目录 +rclone rmdirs + +# 移动文件 +rclone move +# 移动后删除空源目录 +--delete-empty-src-dirs + +# 检查源和目标匹配中的文件 +rclone check +# 从两个源下载数据并在运行中互相检查它们而不是哈希检测 +--download + +rclone md5sum +# 为路径中的所有文件生成md5sum文件 +rclone sha1sum +# 为路径中的所有文件生成sha1sum文件 +rclone size +# 在remote:path中打印该路径下的文件总大小和数量 +--json # 输出json格式 +rclone version --check #检查版本更新 +rclone cleanup # 清理源的垃圾箱或者旧版本文件 + +rclone dedupe # 以交互方式查找重复文件并删除/重命名它们 +--dedupe-mode newest - 删除相同的文件,然后保留最新的文件,非交互方式 + +rclone cat +# 同linux + +rclone copyto +# 将文件从源复制到dest,跳过已复制的文件 + +rclone gendocs output_directory [flags] +# 生成rclone的说明文档 + +rclone listremotes # 列出配置文件中所有源 +--long 显示类型和名称 默认只显示名称 + +rclone moveto +# 不会传输未更改的文件 + +rclone cryptcheck /path/to/files encryptedremote:path +# 检查加密源的完整性 + +rclone about +# 获取源的配额 ,eg +$ rclone about ODA1P1: +Total: 5T +Used: 284.885G +Free: 4.668T +Trashed: 43.141G +--json # 以 json 格式输出 + + +rclone mount # 挂载命令 + +# 在Windows使用则需要安装winfsp +--vfs-cache-mode # 不使用该参数,只能按顺序写入文件,只能在读取时查找,即windows程序无法操作文件,使用该参数即启用缓存机制 +# 共四种模式:off|minimal|writes|full 缓存模式越高,rclone越多,代价是使用磁盘空间,默认为full +--vfs-cache-max-age 24h # 缓存24小时内修改过的文件 +--vfs-cache-max-size 10g # 最大总缓存10g (缓存可能会超过此大小) +--cache-dir 指定缓存位置 +--umask int 覆盖文件系统权限 +--allow-non-empty 允许挂载在非空目录 +--allow-other 允许其他用户访问 +--no-check-certificate 不检查服务器SSL证书 +--no-gzip-encoding 不设置接受gzip编码 + +``` + +## 用自己的 api 进行 gd 转存 + +> +见这位大佬博客:[https://www.moerats.com/archives/877/](https://www.moerats.com/archives/877/) + + +使用 `rclone` 的人太多吉会有一个问题,我们使用的是共享的`client_id`,在高峰期会出现`403`或者还没到`750G`限制就出现`Limitations`问题,所以高频率使用`rclone`转存谷歌文件得朋友就需要使用自己的`api`。通过上面那篇文章给出的方法获取谷歌的 API 客户端`ID`和客户端密钥,`rclone config`命令配置的时候,会有部分提示你输入,直接粘贴就`OK`. + +挂载就变成: + +``` +#该参数主要是上传用的 +/usr/bin/rclone mount DriveName:Folder LocalFolder \ + --umask 0000 \ + --default-permissions \ + --allow-non-empty \ + --allow-other \ + --transfers 4 \ + --buffer-size 32M \ + --low-level-retries 200 + +#如果你还涉及到读取使用,比如使用H5ai等在线播放,就还建议加3个参数,添加格式参考上面 +--dir-cache-time 12h +--vfs-read-chunk-size 32M +--vfs-read-chunk-size-limit 1G + +``` + +## 突破 Google Drive 服务端 750g 限制 + +谷歌官方明确限制通过第三方`api`每天限制转存`750G`文件,这个 `750G` 是直接通过谷歌服务端进行,文件没有经过客户端,另外经过客户端上传到 `gd` 与 服务端转存不冲突,官方也有 `750G` 限制,所以每天上传限额一共是 `1.5T` + +``` +# 一般用法,使用服务端API,不消耗本地流量 +rclone copy GD1:/PATH GD2:/PATH + +# disable server side copies 使用客户端 API,流量走客户端 +rclone --disable copy GD1:/PATH GD2:/PATH + +``` + +这样就是每天 `1.5T` 了。 + +## 谷歌文档限制 + +在 `rclone ls` 中谷歌文档会出现 `-1`, 而对于其他 `VFS` 层文件显示 `0` ,比喻通过 `rclone mount`,`rclone serve`操作的文件。而我们用 `rclone sync`,`rclone copy`的命令时,它会忽略文档大小而直接操作。也就是说如果你没有下载谷歌文档,就不知道它多大,没啥影响… diff --git a/articles/ds19991999/6.转载-Office365 PC版修改更新频道.md b/articles/ds19991999/6.转载-Office365 PC版修改更新频道.md new file mode 100644 index 0000000..9f23062 --- /dev/null +++ b/articles/ds19991999/6.转载-Office365 PC版修改更新频道.md @@ -0,0 +1,7 @@ +# 6.转载:Office365 PC版修改更新频道 + +Office 365 PC版 默认为半年更新频道,可以修改为每月更新频道或其他频道,以体验最新功能。 + +> +原文链接:[https://www.mr-technos.com/forum.php?mod=viewthread&tid=79](https://www.mr-technos.com/forum.php?mod=viewthread&tid=79) + diff --git a/articles/ds19991999/7.原创-转存百度盘到gd-od的解决方案.md b/articles/ds19991999/7.原创-转存百度盘到gd-od的解决方案.md new file mode 100644 index 0000000..98bce8c --- /dev/null +++ b/articles/ds19991999/7.原创-转存百度盘到gd-od的解决方案.md @@ -0,0 +1,91 @@ +# 7.原创:转存百度盘到gd/od的解决方案 + +**首页:**[HomePage](https://telegra.ph/HomePage-01-03)
[https://graph.org/Fuck-PanBaidu-02-19](https://graph.org/Fuck-PanBaidu-02-19) + +### 一、安装aria2 + +``` +wget -N https://git.io/aria2.sh && chmod +x aria2.sh && bash aria2.sh + +``` + +启动:/etc/init.d/aria2 start + +停止:/etc/init.d/aria2 stop + +重启:/etc/init.d/aria2 restart + +查看状态:/etc/init.d/aria2 status + +配置文件:/root/.aria2/aria2.conf (配置文件包含中文注释,但是一些系统可能不支持显示中文) + +令牌密匙:随机生成(可以改配置文件) + +默认下载目录:/root/Download + +### 二、aria2离线gd/od方案 + +1、安装rclone + +``` +curl https://rclone.org/install.sh | sudo bash + +``` + +rclone配置可以参考:[https://rclone.org/drive/](https://rclone.org/drive/) + +2、修改脚本 **/root/.aria2/autoupload.sh** + +``` +- name='Onedrive' #配置Rclone时的name- folder='/DRIVEX/Download' #网盘里的文件夹,留空为网盘根目录。 +``` + +3、修改aria2配置文件:**/root/.aria2/aria2.conf 启用文件下载完成后脚本:** + +``` +- # 调用 rclone 上传(move)到网盘- on-download-complete=/root/.aria2/autoupload.sh +``` + +4、重启 aria2 + +``` +- /root/aria2.sh 选6重启- 或者运行:service aria2 restart +``` + +5、使用aria2前端面板进行文件下载:[aria2.ml](http://aria2.ml/) + +填好vps端的aria2配置信息 + +  + +点击新建粘贴下载链接进行文件下载 + +  + +下载的文件会自动上传到gd/od + +### 三、利用第三方百度盘 + +这里推荐速盘,可惜PanDownload没有开放aria2配置 + +  + +如图,修改下载文件保存位置,GUI界面无法修改,请先退出软件,在config.ini文件中进行修改: + +  + +  + +其中下载文件保存位置与远程服务器的aria2的配置一样,比喻此方式安装的aria2就是**/root/Download** + +于是就可以把你的百度网盘文件直接下载到gd/od中了。 + +### 四、效果图 + +1.使用AriaNG面板下载文件到VPS,利用**autoupload.sh脚本实现gd离线下载电影** + +  + +2.利用速盘远程aria2的功能实现将百度网盘文件远程下载到VPS,再利用**autoupload.sh脚本实现自动转存到gd** + +  diff --git a/articles/ds19991999/README.md b/articles/ds19991999/README.md new file mode 100644 index 0000000..b33f21c --- /dev/null +++ b/articles/ds19991999/README.md @@ -0,0 +1,15 @@ +# ds19991999 的博文 +1. [原创:Debian快速手动安装JupyterLab并配置Https](https://blog.csdn.net/ds19991999/article/details/88935996) +2. [原创:解决套路云Debian新机update的时候出现Waiting for headers和404错误](https://blog.csdn.net/ds19991999/article/details/88659452) +3. [原创:Jekyll 博客 Netlify CMS 后台部署](https://blog.csdn.net/ds19991999/article/details/88651187) +4. [原创:Let's Encrypt 泛域名证书申请](https://blog.csdn.net/ds19991999/article/details/88553810) +5. [原创:Rclone笔记](https://blog.csdn.net/ds19991999/article/details/88370053) +6. [转载:Office365 PC版修改更新频道](https://blog.csdn.net/ds19991999/article/details/87973325) +7. [原创:转存百度盘到gd/od的解决方案](https://blog.csdn.net/ds19991999/article/details/87736377) +8. [原创:以WebDav方式挂载OneDrive](https://blog.csdn.net/ds19991999/article/details/86506042) +9. [原创:接码平台分享](https://blog.csdn.net/ds19991999/article/details/86505762) +10. [原创:CSDN自定义友链侧边栏](https://blog.csdn.net/ds19991999/article/details/86505686) +11. [原创:资源分享](https://blog.csdn.net/ds19991999/article/details/85225611) +12. [原创:Windows上挂载OneDrive为本地硬盘](https://blog.csdn.net/ds19991999/article/details/85008885) +13. [原创:Ubuntu使用日常](https://blog.csdn.net/ds19991999/article/details/83719417) +14. [原创:彻底解决Ubuntu联网问题——网速飞起](https://blog.csdn.net/ds19991999/article/details/83715489) diff --git a/assets/1571482112632.png b/assets/1571482112632.png new file mode 100644 index 0000000..4312ded Binary files /dev/null and b/assets/1571482112632.png differ diff --git a/assets/1571483423256.png b/assets/1571483423256.png new file mode 100644 index 0000000..f03b13b Binary files /dev/null and b/assets/1571483423256.png differ diff --git a/assets/1571483479356.png b/assets/1571483479356.png new file mode 100644 index 0000000..e8875aa Binary files /dev/null and b/assets/1571483479356.png differ diff --git a/assets/1571483552438.png b/assets/1571483552438.png new file mode 100644 index 0000000..c59c2af Binary files /dev/null and b/assets/1571483552438.png differ diff --git a/assets/1571483777703.png b/assets/1571483777703.png new file mode 100644 index 0000000..bc4b27f Binary files /dev/null and b/assets/1571483777703.png differ diff --git a/cookie.txt b/cookie.txt new file mode 100644 index 0000000..75e155e --- /dev/null +++ b/cookie.txt @@ -0,0 +1,13 @@ +Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3 +Accept-Encoding: gzip, deflate, br +Accept-Language: zh-CN,zh;q=0.9 +Cache-Control: max-age=0 +Connection: keep-alive +Cookie: acw_tc=2760829715714827204377171e8e9dc3a79185500e46805511b2c277adf1fb; acw_sc__v3=5daaec608ce6c5ba1fab0c4137c00ecb0cd34525; uuid_tt_dd=10_2450623130-1571482720624-229726; dc_session_id=10_1571482720624.999633; acw_sc__v2=5daaec6067f5ec51b728d2bd7660bf7372ed8903; TY_SESSION_ID=c82ca68f-e408-4c15-b681-71da67f637c2; dc_tos=pzmbtt; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_2450623130-1571482720624-229726; c-login-auto=1; announcement=%257B%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F102605809%2522%252C%2522announcementCount%2522%253A1%252C%2522announcementExpire%2522%253A527116621%257D +Host: blog.csdn.net +Referer: https://blog.csdn.net/ +Sec-Fetch-Mode: navigate +Sec-Fetch-Site: same-origin +Sec-Fetch-User: ?1 +Upgrade-Insecure-Requests: 1 +User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36 \ No newline at end of file diff --git a/csdn.py b/csdn.py new file mode 100644 index 0000000..68692ee --- /dev/null +++ b/csdn.py @@ -0,0 +1,208 @@ +#!/usr/bin/env python +# coding: utf-8 + +import os, time, re +import requests +import threading +import logging +from bs4 import BeautifulSoup, Comment +from selenium import webdriver +from tomd import Tomd + + +def result_file(folder_name, file_name): + folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "articles", folder_name) + if not os.path.exists(folder): + os.makedirs(folder) + path = os.path.join(folder, file_name) + file = open(path,"w") + file.close() + else: + path = os.path.join(folder, file_name) + return path + + +def get_headers(cookie_path:str): + cookies = {} + with open(cookie_path, "r", encoding="utf-8") as f: + cookie_list = f.readlines() + for line in cookie_list: + cookie = line.split(":") + cookies[cookie[0]] = str(cookie[1]).strip() + return cookies + + +def delete_ele(soup:BeautifulSoup, tags:list): + for ele in tags: + for useless_tag in soup.select(ele): + useless_tag.decompose() + + +def delete_ele_attr(soup:BeautifulSoup, attrs:list): + for attr in attrs: + for useless_attr in soup.find_all(): + del useless_attr[attr] + + +def delete_blank_ele(soup:BeautifulSoup, eles_except:list): + for useless_attr in soup.find_all(): + try: + if useless_attr.name not in eles_except and useless_attr.text == "": + useless_attr.decompose() + except Exception: + pass + + +class TaskQueue(object): + def __init__(self): + self.VisitedList = [] + self.UnVisitedList = [] + + def getVisitedList(self): + return self.VisitedList + + def getUnVisitedList(self): + return self.UnVisitedList + + def InsertVisitedList(self, url): + if url not in self.VisitedList: + self.VisitedList.append(url) + + def InsertUnVisitedList(self, url): + if url not in self.UnVisitedList: + self.UnVisitedList.append(url) + + def RemoveVisitedList(self, url): + self.VisitedList.remove(url) + + def PopUnVisitedList(self,index=0): + url = "" + if index and self.UnVisitedList: + url = self.UnVisitedList[index] + del self.UnVisitedList[:index] + elif self.UnVisitedList: + url = self.UnVisitedList.pop() + return url + + def getUnVisitedListLength(self): + return len(self.UnVisitedList) + + +class Article(object): + def __init__(self): + self.options = webdriver.ChromeOptions() + self.options.add_experimental_option('excludeSwitches', ['enable-logging']) + self.options.add_argument('headless') + self.browser = webdriver.Chrome(options=self.options) + # 设置全局智能等待时间 + self.browser.implicitly_wait(30) + + def get_content(self, url): + self.browser.get(url) + try: + self.browser.find_element_by_xpath('//a[@class="btn-readmore"]').click() + except Exception: + pass + content = self.browser.find_element_by_xpath('//div[@id="content_views"]').get_attribute("innerHTML") + return content + + def get_md(self, url): + """ + 转换为markdown格式 + """ + content = self.get_content(url) + soup = BeautifulSoup(content, 'lxml') + # 删除注释 + for useless_tag in soup(text=lambda text: isinstance(text, Comment)): + useless_tag.extract() + # 删除无用标签 + tags = ["svg", "ul", ".hljs-button.signin"] + delete_ele(soup, tags) + # 删除标签属性 + attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"] + delete_ele_attr(soup,attrs) + # 删除空白标签 + eles_except = ["img", "br", "hr"] + delete_blank_ele(soup, eles_except) + # 转换为markdown + md = Tomd(str(soup)).markdown + return md + + +class CSDN(object): + def __init__(self, cookie_path): + self.headers = get_headers(cookie_path) + self.TaskQueue = TaskQueue() + + def get_articles(self, username:str): + """获取文章标题和链接""" + num = 0 + while True: + num += 1 + url = u'https://blog.csdn.net/' + username + '/article/list/' + str(num) + response = requests.get(url=url, headers=self.headers) + html = response.text + soup = BeautifulSoup(html, "html.parser") + articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"}) + if len(articles) > 0: + for article in articles: + article_title = article.a.text.strip().replace(' ',':') + article_href = article.a['href'] + yield article_title,article_href + else: + break + + def write_articals(self, username:str): + """将博文写入本地""" + print("[++] 正在爬取 {} 的博文......".format(username)) + artical = Article() + reademe_path = result_file(username,file_name="README.md") + with open(reademe_path,'w', encoding='utf-8') as reademe_file: + i = 1 + readme_head = "# " + username + " 的博文\n" + reademe_file.write(readme_head) + for article_title,article_href in self.get_articles(username): + print("[++++] {}. 正在处理URL:{}".format(str(i), article_href)) + text = str(i) + '. [' + article_title + ']('+ article_href +')\n' + reademe_file.write(text) + file_name = str(i) + "." + re.sub(r'[\/::*?"<>|]','-', article_title) + ".md" + artical_path = result_file(folder_name=username, file_name=file_name) + md_content = artical.get_md(article_href) + md_head = "# " + str(i) + "." + article_title + "\n" + md = md_head + md_content + with open(artical_path, "w", encoding="utf-8") as artical_file: + artical_file.write(md) + i += 1 + time.sleep(2) + + def spider(self): + """将爬取到的文章保存到本地""" + while True: + if self.TaskQueue.getUnVisitedListLength(): + username = self.TaskQueue.PopUnVisitedList() + self.write_articals(username) + + def check_user(self, user_path:str): + with open(user_path, 'r', encoding='utf-8') as f: + users = f.readlines() + for user in users: + self.TaskQueue.InsertUnVisitedList(user.strip()) + + def run(self, user_path): + UserThread = threading.Thread(target=self.check_user, args=(user_path,)) + SpiderThread = threading.Thread(target=self.spider, args=()) + UserThread.start() + SpiderThread.start() + UserThread.join() + SpiderThread.join() + + +def main(): + user_path = 'username.txt' + csdn = CSDN('cookie.txt') + csdn.run(user_path) + + +if __name__ == "__main__": + main() + diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..861958b --- /dev/null +++ b/requirements.txt @@ -0,0 +1,3 @@ +bs4==0.0.1 +selenium==3.141.0 +requests==2.22.0 \ No newline at end of file diff --git a/tomd.py b/tomd.py new file mode 100644 index 0000000..db4f893 --- /dev/null +++ b/tomd.py @@ -0,0 +1,155 @@ +import re + +__all__ = ['Tomd', 'convert'] + +MARKDOWN = { + 'h1': ('\n# ', '\n'), + 'h2': ('\n## ', '\n'), + 'h3': ('\n### ', '\n'), + 'h4': ('\n#### ', '\n'), + 'h5': ('\n##### ', '\n'), + 'h6': ('\n###### ', '\n'), + 'code': ('`', '`'), + 'ul': ('', ''), + 'ol': ('', ''), + 'li': ('- ', ''), + 'blockquote': ('\n> ', '\n'), + 'em': ('**', '**'), + 'strong': ('**', '**'), + 'block_code': ('\n```\n', '\n```\n'), + 'span': ('', ''), + 'p': ('\n', '\n'), + 'p_with_out_class': ('\n', '\n'), + 'inline_p': ('', ''), + 'inline_p_with_out_class': ('', ''), + 'b': ('**', '**'), + 'i': ('*', '*'), + 'del': ('~~', '~~'), + 'hr': ('\n---', '\n\n'), + 'thead': ('\n', '|------\n'), + 'tbody': ('\n', '\n'), + 'td': ('|', ''), + 'th': ('|', ''), + 'tr': ('', '\n') +} + +BlOCK_ELEMENTS = { + 'h1': '(.*?)', + 'h2': '(.*?)', + 'h3': '(.*?)', + 'h4': '(.*?)', + 'h5': '(.*?)', + 'h6': '(.*?)', + 'hr': '
', + 'blockquote': '(.*?)', + 'ul': '(.*?)', + 'ol': '(.*?)', + 'block_code': '(.*?)', + 'p': '(.*?)

', + 'p_with_out_class': '


', + 'thead': '(.*?)', + 'tr': '(.*?)' +} + +INLINE_ELEMENTS = { + 'td': '(.*?)', + 'tr': '(.*?)', + 'th': '(.*?)', + 'b': '(.*?)', + 'i': '(.*?)', + 'del': '(.*?)', + 'inline_p': '(.*?)

', + 'inline_p_with_out_class': '


', + 'code': '(.*?)', + 'span': '(.*?)', + 'ul': '(.*?)', + 'ol': '(.*?)', + 'li': '(.*?)', + 'img': '(.*?)', + 'a': '(.*?)', + 'em': '(.*?)', + 'strong': '(.*?)' +} + +DELETE_ELEMENTS = ['', '', '', ''] + + +class Element: + def __init__(self, start_pos, end_pos, content, tag, is_block=False): + self.start_pos = start_pos + self.end_pos = end_pos + self.content = content + self._elements = [] + self.is_block = is_block + self.tag = tag + self._result = None + + if self.is_block: + self.parse_inline() + + def __str__(self): + wrapper = MARKDOWN.get(self.tag) + self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1]) + return self._result + + def parse_inline(self): + for tag, pattern in INLINE_ELEMENTS.items(): + + if tag == 'a': + self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content) + elif tag == 'img': + self.content = re.sub(pattern, '![\g<2>](\g<1>)', self.content) + elif self.tag == 'ul' and tag == 'li': + self.content = re.sub(pattern, '- \g<1>', self.content) + elif self.tag == 'ol' and tag == 'li': + self.content = re.sub(pattern, '1. \g<1>', self.content) + elif self.tag == 'thead' and tag == 'tr': + self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', '')) + elif self.tag == 'tr' and tag == 'th': + self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', '')) + elif self.tag == 'tr' and tag == 'td': + self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', '')) + else: + wrapper = MARKDOWN.get(tag) + self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content) + + +class Tomd: + def __init__(self, html='', options=None): + self.html = html + self.options = options + self._markdown = '' + + def convert(self, html, options=None): + elements = [] + for tag, pattern in BlOCK_ELEMENTS.items(): + for m in re.finditer(pattern, html, re.I | re.S | re.M): + element = Element(start_pos=m.start(), + end_pos=m.end(), + content=''.join(m.groups()), + tag=tag, + is_block=True) + can_append = True + for e in elements: + if e.start_pos < m.start() and e.end_pos > m.end(): + can_append = False + elif e.start_pos > m.start() and e.end_pos < m.end(): + elements.remove(e) + if can_append: + elements.append(element) + + elements.sort(key=lambda element: element.start_pos) + self._markdown = ''.join([str(e) for e in elements]) + + for index, element in enumerate(DELETE_ELEMENTS): + self._markdown = re.sub(element, '', self._markdown) + return self._markdown + + @property + def markdown(self): + self.convert(self.html, self.options) + return self._markdown + + +_inst = Tomd() +convert = _inst.convert diff --git a/username.txt b/username.txt new file mode 100644 index 0000000..adb7063 --- /dev/null +++ b/username.txt @@ -0,0 +1 @@ +ds19991999 \ No newline at end of file