commit demo

master
ds19991999 2019-10-19 19:23:00 +08:00
parent 604709b719
commit db79d3c171
19 changed files with 1029 additions and 1 deletions

View File

@ -1 +1,62 @@
# csdn-spider
# CSDN 爬虫脚本
主要功能:爬取 `csdn` 博客指定用户的所有博文并转换为 `markdown` 格式保存到本地。
## 一、运行环境
需要安装`WebDriver`驱动https://chromedriver.chromium.org/downloads下载与本地对应的`chrome`驱动后,将其添加至环境变量`$PATH`
```shell
python3
python3 -m pip install -r requirements.txt
```
## 二、获取脚本
```shell
git clone https://github.com/ds19991999/csdn-spider.git
```
## 三、用法
### 1.获取cookie
登录 `csdn` 账号进入https://blog.csdn.net ,按 `F12` 调试网页,复制所有的 `Request Headers`,保存到`cookie.txt`文件中
![1571482112632](assets/1571482112632.png)
### 2.添加需要爬取的 `csdn` 用户
在`username.txt`中添加用户名,一行一个
### 3.运行脚本
```shell
python3 csdn.py
```
## 四、效果
**运行过程**
![1571483423256](assets/1571483423256.png)
**文章列表建立**`./articles/username/README.md`
![1571483552438](assets/1571483552438.png)
**爬取的博文**`./articles/username/`
![1571483479356](assets/1571483479356.png)
**博文转换效果**
![1571483777703](assets/1571483777703.png)
## 五、LICENSE
<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
`PS`:随意写的爬虫脚本,佛系更新。

View File

@ -0,0 +1,186 @@
# 1.原创Debian快速手动安装JupyterLab并配置Https
很久之前我写过一篇关于`Jupyer lab`得超详细安装教程,[`传送门`](https://www.creat.kim/archives/25/),感觉复杂了点,特别是`nginx`,我这块也没写清楚,所以不少人出现了无法运行`python`的情况,按照教程一步步来是绝对不会出问题的。有时候,虽然你能够用`https`访问,但是不代表就能运行,因为这里`jupyter lab`是基于`websocket`通信的,不是`http`。这里就再简化一下,用`Debian`系统安装一下`Jupyter Lab`,并使用`caddy`配置`https`访问,亲测可以运行程序。本教程只包括`Pytho2`内核,要同时安装`Python3`见[`传送门`](https://www.creat.kim/archives/25/),这里简单写下步骤,快速上手,避免花费过多时间,一次成功,速度还蛮快的. demo: [https://jupyter.creat.kim](https://jupyter.creat.kim)<br/>
<img alt="" src="http://image.creat.kim/picgo/20190326142651.png"/><br/>
<img alt="" src="http://image.creat.kim/picgo/20190326151655.png"/>
```
sudo apt-get install software-properties-common
```
## 安装`Python`环境
```
sudo apt-get install python-pip python-dev build-essential
sudo pip install --upgrade pip
sudo pip install --upgrade virtualenv
sudo apt-get install python-setuptools python-dev build-essential
sudo easy_install pip
sudo pip install --upgrade virtualenv
sudo apt-get install python3-pip
sudo apt-get install python-pip
sudo pip3 install --upgrade pip
sudo pip2 install --upgrade pip
sudo pip install --upgrade pip
```
## 查看`pip`指向
```
~ $which pip
/usr/local/bin/pip
21:36 alien@alien-Inspiron-3443:
~ $which pip2
/usr/local/bin/pip2
21:36 alien@alien-Inspiron-3443:
~ $which pip3
/usr/local/bin/pip3
```
## 安装`yarn`
```
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
sudo apt-get update
sudo apt-get install yarn
```
## 安装`nodejs`
```
curl -sL https://deb.nodesource.com/setup_10.x | bash -
apt-get install -y nodejs
```
## 安装`jupyterlab`
```
sudo pip2 install jupyterlab
```
## 配置`jupyerlab`
```
jupyter-notebook password
```
进入`ipython`设置哈希密码,这里输入的是你登陆`jupyter lab`的密码,记下生成的哈希密码.
```
ipython
from notebook.auth import passwd
passwd()
# 输入你自己设置登录JupyterLab界面的密码
# 然后就会生产下面这样的密码,将它记下来,待会儿用
'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704'
```
## 编辑配置文件
一般在`/root/.jupyter/jupyter_notebook_config.py`中,找到并修改以下配置项。
```
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.notebook_dir = u'/root/JupyterLab'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:b92f3fb7d848:a5d40ab2e26aa3b296ae1faa17aa34d3df351704'
c.NotebookApp.port = 8888
# 解释以上各项
允许以root方式运行jupyterlab
允许任意ip段访问
设置jupyterlab页面的根目录
默认运行时不启动浏览器,因为服务器默认只有终端嘛
设置之前生产的哈希密码
设置访问端口与下面的caddy需一致
```
## 运行`Jupyter Lab`
```
jupyter-lab --version
jupyter lab build
mkdir ~/JupyterLab
cd ~/JupyterLab
# 方便后台运行
apt install screen
screen -S jupterlab
jupyter lab
```
`ctrl+A+D`退出这个窗口。
## `caddy`开启`https`反代
域名改成你自己的,`caddy`详细使用见:[`【传送门】`](https://www.creat.kim/archives/18/)
```
wget -N --no-check-certificate https://raw.githubusercontent.com/ds19991999/shell.sh/shell/caddy_install.sh &amp;&amp; chmod +x caddy_install.sh &amp;&amp; bash caddy_install.sh
echo "jupyter.creat.kim
gzip
tls cva.engineer.ding@gmail.com
proxy / 127.0.0.1:8888 {
transparent
websocket
}" &gt; /usr/local/caddy/Caddyfile
```
## 定时备份到`GitHub`
见大佬写的比较详细的文章:[`【传送门】`](https://www.moerats.com/archives/858/)
## 配置`python2`和`python3`内核
好人做到底吧,这里肯定很多人踩坑。。。用`pip`安装包的时候千万不要用`pip3 install ***`或者`pip2 install ***`呀.
```
python2 -m pip install ipykernel ipython matplotlib scipy pandas numpy
python3 -m pip install ipykernel ipython matplotlib scipy pandas numpy
```
检查一下内核
```
root@google:~/JupyterLab# jupyter kernelspec list
Available kernels:
python2 /usr/local/share/jupyter/kernels/python2
python3 /usr/local/share/jupyter/kernels/python3
```
好了,访问域名,开始使用吧。
---
## 最后一点思悟
大概这是我发在`CSDN`最后的博文了,本文来自 [https://www.creat.kim/archives/40/](https://www.creat.kim/archives/40/) ,不错,终于抛弃公共博客平台了。我在`CSDN`写了差不多一年半左右的博文吧,共`107`篇,其中`97`篇原(chao)创(xi)`7`篇转载,`2`篇私密,`1`篇因违反相关政策被管理员设为私密 … 博客`CSDN`排名`10k+`,访问量`225k+`,粉丝数`48`,表现平平,博文水平一般,算是代表了大部分人吧。
国内的博客平台其实都不错,`CSDN` 的写作体验也非常好,我曾经也一度在自己的博客平台或者公共博客平台之间徘徊,慢慢的最初写博客的意义就变味了,不过经历过这个过程,大概就明白了一些事吧。
在尝试`WordPress` 、`知乎` 、`简书`、`博客园`、`新浪`、`GitHub-Jekyll` 、`coding-jekyll`、`hexo` 、`Typecho`…之后,了解了一些网站运行常识,最起码知道国内的都是需要备案的 …<br/>
在图床方面,从最初的直接复制粘贴到`GitHub`+`PicGo`、`又拍云` (需要备案)、`七牛云`(需要备案)、自建图床…明白了一些`CDN`加速技巧 …<br/>
在文档方面,从最初的直接编辑,到`CSDN`的`MarkDown`编辑器、`有道云笔记`、`Evernote`(分国外国内版本)、`GitHub-README`、`GitBook`、`MkDoc`、`Read the Docs`、`Sphinx`、`Docsify`,明白了孰能生巧,熟练的话,什么文本都能写的漂亮,虽然我至今不会`Vim` …<br/>
在服务器选择上面,国内和国外的差异,也了解了不少,也越来越深恶痛绝 `install` 一个包或者一个`程序`的时候,你就那么几`k`几`b`的跑,国内源再怎么换,也比不上国外源的速度,有些网站虽然没有被`q`,你本地那速度受的了吗,现在也服气当初我是怎么忍受那龟一般的网速。看到过,了解过,才能从另一个角度看待问题,总比一直看被经过过滤的信息强吧。
再看看国外的教育福利,有人说是国外被中国人撸羊毛撸怕了,所以就不给中国提供教育福利。但是你看看国内大厂的教育福利,那服务器多便宜,我自己都眼馋,赶紧去每个厂注册一个号。要求实名,好,我实名,我传照片;要求备案,啥,还备案,好,我备案,我传照片,又是一个星期;这咋还有监测呢,忍不了了 … 这像不像裸贷你只要用身份证实名把自己的靓照交给他他就给你提供廉价的服务器这里说的有点过了哈哈哈。前不久谷歌也要求中国IP注册地需要传照片了唯独中国。国外在教育方面的投资我们真的要好好学习学习 …
之前的`12306事件`、`蓝灯事件`、`某某数据库泄露`,真真假假假亦真。身在国内,就不得不用隐私换取便利。

View File

@ -0,0 +1,33 @@
# 2.原创解决套路云Debian新机update的时候出现Waiting for headers和404错误
```
rm -rf /root/.pip /root/.pydistutils.cfg /etc/apt/sources.list.d/sources-aliyun-0.list /etc/apt/sources.list.d/sources-aliyun* /var/lib/apt/lists/*
```
```
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie main contrib non-free
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-proposed-updates main non-free contrib
deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free
deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-updates main contrib non-free
## Uncomment the following two lines to add software from the 'backports'
## repository.
##
## N.B. software from this repository may not have been tested as
## extensively as that contained in the main release, although it includes
## newer versions of some applications which may provide useful features.
#deb http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free
#deb-src http://mirrors.cloud.aliyuncs.com/debian/ jessie-backports main contrib non-free
```
```
apt-get clean
apt-get update
```
套路云还是套路云,服气!!!

View File

@ -0,0 +1,3 @@
# 3.原创Jekyll 博客 Netlify CMS 后台部署
### 文章目录

View File

@ -0,0 +1,71 @@
# 4.原创Let's Encrypt 泛域名证书申请
>
github: [https://github.com/Neilpang/acme.sh](https://github.com/Neilpang/acme.sh)
通过acme申请Lets Encrypt证书支持的域名DNS服务商有以下这些国内用户较多的`cloudxns、dnspod、aliyun阿里云、cloudflare、linode、he、digitalocean、namesilo、aws、namecom、freedns、godaddy、yandex` 等等。
### 目录
## [安装acm.sh](http://xn--acm-pd0fq01r.sh)
```
curl https://get.acme.sh | sh
```
`acme.sh`被安装在了`~./.acme.sh`,创建 一个 `bash``alias`, 方便你的使用: `alias acme.sh=~/.acme.sh/acme.sh`
通过`acme.sh`安装的证书会自动为你创建 `cronjob`, 每天 0:00 点自动检测所有的证书, 如果快过期了, 需要更新, 则会自动更新证书.
## DNS方式验证域名所有权
```
acme.sh --issue --dns -d mydomain.com
```
`acme.sh` 会生成相应的解析记录显示出来, 你只需要在你的域名管理面板中添加这条 `txt` 记录即可.
## 获取`DNS API`
获取`DNS`域名商的`DNS API` `api` 也会将 上面的`txt` 记录自动添加到域名解析商。比喻阿里的`api`[https://ak-console.aliyun.com/#/accesskey](https://ak-console.aliyun.com/#/accesskey) ,然后看说明进行配置 [https://github.com/Neilpang/acme.sh/tree/master/dnsapi](https://github.com/Neilpang/acme.sh/tree/master/dnsapi) 阿里的就是:
```
export Ali_Key="sdfsdfsdfljlbjkljlkjsdfoiwje"
export Ali_Secret="jlsdflanljkljlfdsaklkjflsa"
acme.sh --issue --dns dns_ali -d example.com -d *.example.com
```
这个`*`值的就是泛域名。运行一次之后Ali_Key和Ali_Secret将被保存`~/.acme.sh/account.conf`生成的SSL证书目录在`~/.acme.sh/example.com`
## 安装证书
>
详见:[copy/安装 证书](https://github.com/Neilpang/acme.sh/wiki/%E8%AF%B4%E6%98%8E#3-copy%E5%AE%89%E8%A3%85-%E8%AF%81%E4%B9%A6)
使用 `--installcert` 命令,并指定目标位置, 然后证书文件会被copy到相应的位置, 例如:
```
acme.sh --installcert -d &lt;domain&gt;.com \
--key-file /etc/nginx/ssl/&lt;domain&gt;.key \
--fullchain-file /etc/nginx/ssl/fullchain.cer \
--reloadcmd "service nginx force-reload"
```
宝塔用户在SSL选项选择其他证书把SSL证书内容粘贴上面去就行了<br/>
<img alt="" src="http://image.creat.kim/picgo/20190314132922.png"/><br/>
这里改一下证书路径<br/>
<img alt="" src="http://image.creat.kim/picgo/20190314132617.png"/><br/>
目前证书在 60 天以后会自动更新, 你无需任何操作. 今后有可能会缩短这个时间, 不过都是自动的, 你不用关心.
## 更新 `acme.sh`
自动更新:`acme.sh --upgrade --auto-upgrade`<br/>
关闭更新:`acme.sh --upgrade --auto-upgrade 0`
有问题看 [wiki](https://github.com/Neilpang/acme.sh/wiki) 和 [dubug](https://github.com/Neilpang/acme.sh/wiki/How-to-debug-acme.sh)

View File

@ -0,0 +1,181 @@
# 5.原创Rclone笔记
>
### 目录
## 一些简单命令
### 挂载
```
# windows 挂载命令
rclone mount OD:/ H: --cache-dir E:\ODPATH --vfs-cache-mode writes &amp;
# linux 挂载命令
nohup rclone mount GD:/ /root/GDPATH --copy-links --no-gzip-encoding --no-check-certificate --allow-other --allow-non-empty --umask 000 &amp;
# 取消挂载————linux 通用
fusermount -qzu /root/GDPATH 或者
fusermount -u /path/to/local/mount
# windows 取消挂载
umount /path/to/local/mount
```
### rclone命令
```
rclone ls
eg____rclone ls remote:path [flags]
ls # 递归列出 remote 所有文件及其大小,有点类似 tree 命令
lsl # 递归列出 remote 所有文件、大小及修改时间
lsd # 仅仅列出文件夹的修改时间和文件夹内的文件个数
lsf # 列出当前层级的文件或文件夹名称
lsjson # 以JSON格式列出文件和目录
rclone copy
eg____rclone copy OD:/SOME/PATH GD:/OTHER/PATH
--no-traverse # /path/to/src中有很多文件但每天只有少数文件发生变化加上这个参数可以提高传输速度
-P # 实时查看传输统计信息
--max-age 24h # 仅仅传输24小时内修改过的文件默认关闭
rclone copy --max-age 24h --no-traverse /path/to/src remote:/PATH -P
rclone sync
eg____rclone sync source:path dest:path [flags]
# 使用该命令时先用 --dry-run 测试,明确要复制和删除的内容
rclone delete
# 列出大于 100M 的文件
rclone --min-size 100M lsl remote:path
# 删除测试
rclone --dry-run --min-size 100M delete remote:path
# 删除
rclone --min-size 100M delete remote:path
# 删除路径及其所有内容filters此时无效这与 delete 不同
rclone purge
# 删除空路径
rclone rmdir
# 删除路径下的空目录
rclone rmdirs
# 移动文件
rclone move
# 移动后删除空源目录
--delete-empty-src-dirs
# 检查源和目标匹配中的文件
rclone check
# 从两个源下载数据并在运行中互相检查它们而不是哈希检测
--download
rclone md5sum
# 为路径中的所有文件生成md5sum文件
rclone sha1sum
# 为路径中的所有文件生成sha1sum文件
rclone size
# 在remotepath中打印该路径下的文件总大小和数量
--json # 输出json格式
rclone version --check #检查版本更新
rclone cleanup # 清理源的垃圾箱或者旧版本文件
rclone dedupe # 以交互方式查找重复文件并删除/重命名它们
--dedupe-mode newest - 删除相同的文件,然后保留最新的文件,非交互方式
rclone cat
# 同linux
rclone copyto
# 将文件从源复制到dest跳过已复制的文件
rclone gendocs output_directory [flags]
# 生成rclone的说明文档
rclone listremotes # 列出配置文件中所有源
--long 显示类型和名称 默认只显示名称
rclone moveto
# 不会传输未更改的文件
rclone cryptcheck /path/to/files encryptedremote:path
# 检查加密源的完整性
rclone about
# 获取源的配额 eg
$ rclone about ODA1P1:
Total: 5T
Used: 284.885G
Free: 4.668T
Trashed: 43.141G
--json # 以 json 格式输出
rclone mount # 挂载命令
# 在Windows使用则需要安装winfsp
--vfs-cache-mode # 不使用该参数只能按顺序写入文件只能在读取时查找即windows程序无法操作文件使用该参数即启用缓存机制
# 共四种模式off|minimal|writes|full 缓存模式越高rclone越多代价是使用磁盘空间默认为full
--vfs-cache-max-age 24h # 缓存24小时内修改过的文件
--vfs-cache-max-size 10g # 最大总缓存10g (缓存可能会超过此大小)
--cache-dir 指定缓存位置
--umask int 覆盖文件系统权限
--allow-non-empty 允许挂载在非空目录
--allow-other 允许其他用户访问
--no-check-certificate 不检查服务器SSL证书
--no-gzip-encoding 不设置接受gzip编码
```
## 用自己的 api 进行 gd 转存
>
见这位大佬博客:[https://www.moerats.com/archives/877/](https://www.moerats.com/archives/877/)
使用 `rclone` 的人太多吉会有一个问题,我们使用的是共享的`client_id`,在高峰期会出现`403`或者还没到`750G`限制就出现`Limitations`问题,所以高频率使用`rclone`转存谷歌文件得朋友就需要使用自己的`api`。通过上面那篇文章给出的方法获取谷歌的 API 客户端`ID`和客户端密钥,`rclone config`命令配置的时候,会有部分提示你输入,直接粘贴就`OK`.
挂载就变成:
```
#该参数主要是上传用的
/usr/bin/rclone mount DriveName:Folder LocalFolder \
--umask 0000 \
--default-permissions \
--allow-non-empty \
--allow-other \
--transfers 4 \
--buffer-size 32M \
--low-level-retries 200
#如果你还涉及到读取使用比如使用H5ai等在线播放就还建议加3个参数添加格式参考上面
--dir-cache-time 12h
--vfs-read-chunk-size 32M
--vfs-read-chunk-size-limit 1G
```
## 突破 Google Drive 服务端 750g 限制
谷歌官方明确限制通过第三方`api`每天限制转存`750G`文件,这个 `750G` 是直接通过谷歌服务端进行,文件没有经过客户端,另外经过客户端上传到 `gd` 与 服务端转存不冲突,官方也有 `750G` 限制,所以每天上传限额一共是 `1.5T`
```
# 一般用法使用服务端API不消耗本地流量
rclone copy GD1:/PATH GD2:/PATH
# disable server side copies 使用客户端 API流量走客户端
rclone --disable copy GD1:/PATH GD2:/PATH
```
这样就是每天 `1.5T` 了。
## 谷歌文档限制
`rclone ls` 中谷歌文档会出现 `-1` 而对于其他 `VFS` 层文件显示 `0` ,比喻通过 `rclone mount``rclone serve`操作的文件。而我们用 `rclone sync``rclone copy`的命令时,它会忽略文档大小而直接操作。也就是说如果你没有下载谷歌文档,就不知道它多大,没啥影响…

View File

@ -0,0 +1,7 @@
# 6.转载Office365 PC版修改更新频道
Office 365 PC版 默认为半年更新频道,可以修改为每月更新频道或其他频道,以体验最新功能。
>
原文链接:[https://www.mr-technos.com/forum.php?mod=viewthread&amp;tid=79](https://www.mr-technos.com/forum.php?mod=viewthread&amp;tid=79)

View File

@ -0,0 +1,91 @@
# 7.原创转存百度盘到gd/od的解决方案
**首页:**[HomePage](https://telegra.ph/HomePage-01-03)<br/>[https://telegra.ph/Fuck-PanBaidu-02-19](https://telegra.ph/Fuck-PanBaidu-02-19) <br/>[https://graph.org/Fuck-PanBaidu-02-19](https://graph.org/Fuck-PanBaidu-02-19)
### 一、安装aria2
```
wget -N https://git.io/aria2.sh &amp;&amp; chmod +x aria2.sh &amp;&amp; bash aria2.sh
```
启动:/etc/init.d/aria2 start
停止:/etc/init.d/aria2 stop
重启:/etc/init.d/aria2 restart
查看状态:/etc/init.d/aria2 status
配置文件:/root/.aria2/aria2.conf 配置文件包含中文注释但是一些系统可能不支持显示中文
令牌密匙:随机生成(可以改配置文件)
默认下载目录:/root/Download
### 二、aria2离线gd/od方案
1、安装rclone
```
curl https://rclone.org/install.sh | sudo bash
```
rclone配置可以参考[https://rclone.org/drive/](https://rclone.org/drive/)
2、修改脚本 **/root/.aria2/autoupload.sh**
```
- name='Onedrive' #配置Rclone时的name- folder='/DRIVEX/Download' #网盘里的文件夹,留空为网盘根目录。
```
3、修改aria2配置文件**/root/.aria2/aria2.conf 启用文件下载完成后脚本:**
```
- # 调用 rclone 上传(move)到网盘- on-download-complete=/root/.aria2/autoupload.sh
```
4、重启 aria2
```
- /root/aria2.sh 选6重启- 或者运行service aria2 restart
```
5、使用aria2前端面板进行文件下载[aria2.ml](http://aria2.ml/)
填好vps端的aria2配置信息
 
点击新建粘贴下载链接进行文件下载
 
下载的文件会自动上传到gd/od
### 三、利用第三方百度盘
这里推荐速盘可惜PanDownload没有开放aria2配置
 
如图修改下载文件保存位置GUI界面无法修改请先退出软件在config.ini文件中进行修改
 
 
其中下载文件保存位置与远程服务器的aria2的配置一样比喻此方式安装的aria2就是**/root/Download**
于是就可以把你的百度网盘文件直接下载到gd/od中了。
### 四、效果图
1.使用AriaNG面板下载文件到VPS利用**autoupload.sh脚本实现gd离线下载电影**
 
2.利用速盘远程aria2的功能实现将百度网盘文件远程下载到VPS再利用**autoupload.sh脚本实现自动转存到gd**
 

View File

@ -0,0 +1,15 @@
# ds19991999 的博文
1. [原创Debian快速手动安装JupyterLab并配置Https](https://blog.csdn.net/ds19991999/article/details/88935996)
2. [原创解决套路云Debian新机update的时候出现Waiting for headers和404错误](https://blog.csdn.net/ds19991999/article/details/88659452)
3. [原创Jekyll 博客 Netlify CMS 后台部署](https://blog.csdn.net/ds19991999/article/details/88651187)
4. [原创Let's Encrypt 泛域名证书申请](https://blog.csdn.net/ds19991999/article/details/88553810)
5. [原创Rclone笔记](https://blog.csdn.net/ds19991999/article/details/88370053)
6. [转载Office365 PC版修改更新频道](https://blog.csdn.net/ds19991999/article/details/87973325)
7. [原创转存百度盘到gd/od的解决方案](https://blog.csdn.net/ds19991999/article/details/87736377)
8. [原创以WebDav方式挂载OneDrive](https://blog.csdn.net/ds19991999/article/details/86506042)
9. [原创:接码平台分享](https://blog.csdn.net/ds19991999/article/details/86505762)
10. [原创CSDN自定义友链侧边栏](https://blog.csdn.net/ds19991999/article/details/86505686)
11. [原创:资源分享](https://blog.csdn.net/ds19991999/article/details/85225611)
12. [原创Windows上挂载OneDrive为本地硬盘](https://blog.csdn.net/ds19991999/article/details/85008885)
13. [原创Ubuntu使用日常](https://blog.csdn.net/ds19991999/article/details/83719417)
14. [原创彻底解决Ubuntu联网问题——网速飞起](https://blog.csdn.net/ds19991999/article/details/83715489)

BIN
assets/1571482112632.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 440 KiB

BIN
assets/1571483423256.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

BIN
assets/1571483479356.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

BIN
assets/1571483552438.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

BIN
assets/1571483777703.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

13
cookie.txt Normal file
View File

@ -0,0 +1,13 @@
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: acw_tc=2760829715714827204377171e8e9dc3a79185500e46805511b2c277adf1fb; acw_sc__v3=5daaec608ce6c5ba1fab0c4137c00ecb0cd34525; uuid_tt_dd=10_2450623130-1571482720624-229726; dc_session_id=10_1571482720624.999633; acw_sc__v2=5daaec6067f5ec51b728d2bd7660bf7372ed8903; TY_SESSION_ID=c82ca68f-e408-4c15-b681-71da67f637c2; dc_tos=pzmbtt; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1571482722; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_2450623130-1571482720624-229726; c-login-auto=1; announcement=%257B%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F102605809%2522%252C%2522announcementCount%2522%253A1%252C%2522announcementExpire%2522%253A527116621%257D
Host: blog.csdn.net
Referer: https://blog.csdn.net/
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36

208
csdn.py Normal file
View File

@ -0,0 +1,208 @@
#!/usr/bin/env python
# coding: utf-8
import os, time, re
import requests
import threading
import logging
from bs4 import BeautifulSoup, Comment
from selenium import webdriver
from tomd import Tomd
def result_file(folder_name, file_name):
folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "articles", folder_name)
if not os.path.exists(folder):
os.makedirs(folder)
path = os.path.join(folder, file_name)
file = open(path,"w")
file.close()
else:
path = os.path.join(folder, file_name)
return path
def get_headers(cookie_path:str):
cookies = {}
with open(cookie_path, "r", encoding="utf-8") as f:
cookie_list = f.readlines()
for line in cookie_list:
cookie = line.split(":")
cookies[cookie[0]] = str(cookie[1]).strip()
return cookies
def delete_ele(soup:BeautifulSoup, tags:list):
for ele in tags:
for useless_tag in soup.select(ele):
useless_tag.decompose()
def delete_ele_attr(soup:BeautifulSoup, attrs:list):
for attr in attrs:
for useless_attr in soup.find_all():
del useless_attr[attr]
def delete_blank_ele(soup:BeautifulSoup, eles_except:list):
for useless_attr in soup.find_all():
try:
if useless_attr.name not in eles_except and useless_attr.text == "":
useless_attr.decompose()
except Exception:
pass
class TaskQueue(object):
def __init__(self):
self.VisitedList = []
self.UnVisitedList = []
def getVisitedList(self):
return self.VisitedList
def getUnVisitedList(self):
return self.UnVisitedList
def InsertVisitedList(self, url):
if url not in self.VisitedList:
self.VisitedList.append(url)
def InsertUnVisitedList(self, url):
if url not in self.UnVisitedList:
self.UnVisitedList.append(url)
def RemoveVisitedList(self, url):
self.VisitedList.remove(url)
def PopUnVisitedList(self,index=0):
url = ""
if index and self.UnVisitedList:
url = self.UnVisitedList[index]
del self.UnVisitedList[:index]
elif self.UnVisitedList:
url = self.UnVisitedList.pop()
return url
def getUnVisitedListLength(self):
return len(self.UnVisitedList)
class Article(object):
def __init__(self):
self.options = webdriver.ChromeOptions()
self.options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.options.add_argument('headless')
self.browser = webdriver.Chrome(options=self.options)
# 设置全局智能等待时间
self.browser.implicitly_wait(30)
def get_content(self, url):
self.browser.get(url)
try:
self.browser.find_element_by_xpath('//a[@class="btn-readmore"]').click()
except Exception:
pass
content = self.browser.find_element_by_xpath('//div[@id="content_views"]').get_attribute("innerHTML")
return content
def get_md(self, url):
"""
转换为markdown格式
"""
content = self.get_content(url)
soup = BeautifulSoup(content, 'lxml')
# 删除注释
for useless_tag in soup(text=lambda text: isinstance(text, Comment)):
useless_tag.extract()
# 删除无用标签
tags = ["svg", "ul", ".hljs-button.signin"]
delete_ele(soup, tags)
# 删除标签属性
attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"]
delete_ele_attr(soup,attrs)
# 删除空白标签
eles_except = ["img", "br", "hr"]
delete_blank_ele(soup, eles_except)
# 转换为markdown
md = Tomd(str(soup)).markdown
return md
class CSDN(object):
def __init__(self, cookie_path):
self.headers = get_headers(cookie_path)
self.TaskQueue = TaskQueue()
def get_articles(self, username:str):
"""获取文章标题和链接"""
num = 0
while True:
num += 1
url = u'https://blog.csdn.net/' + username + '/article/list/' + str(num)
response = requests.get(url=url, headers=self.headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"})
if len(articles) > 0:
for article in articles:
article_title = article.a.text.strip().replace(' ','')
article_href = article.a['href']
yield article_title,article_href
else:
break
def write_articals(self, username:str):
"""将博文写入本地"""
print("[++] 正在爬取 {} 的博文......".format(username))
artical = Article()
reademe_path = result_file(username,file_name="README.md")
with open(reademe_path,'w', encoding='utf-8') as reademe_file:
i = 1
readme_head = "# " + username + " 的博文\n"
reademe_file.write(readme_head)
for article_title,article_href in self.get_articles(username):
print("[++++] {}. 正在处理URL{}".format(str(i), article_href))
text = str(i) + '. [' + article_title + ']('+ article_href +')\n'
reademe_file.write(text)
file_name = str(i) + "." + re.sub(r'[\/:*?"<>|]','-', article_title) + ".md"
artical_path = result_file(folder_name=username, file_name=file_name)
md_content = artical.get_md(article_href)
md_head = "# " + str(i) + "." + article_title + "\n"
md = md_head + md_content
with open(artical_path, "w", encoding="utf-8") as artical_file:
artical_file.write(md)
i += 1
time.sleep(2)
def spider(self):
"""将爬取到的文章保存到本地"""
while True:
if self.TaskQueue.getUnVisitedListLength():
username = self.TaskQueue.PopUnVisitedList()
self.write_articals(username)
def check_user(self, user_path:str):
with open(user_path, 'r', encoding='utf-8') as f:
users = f.readlines()
for user in users:
self.TaskQueue.InsertUnVisitedList(user.strip())
def run(self, user_path):
UserThread = threading.Thread(target=self.check_user, args=(user_path,))
SpiderThread = threading.Thread(target=self.spider, args=())
UserThread.start()
SpiderThread.start()
UserThread.join()
SpiderThread.join()
def main():
user_path = 'username.txt'
csdn = CSDN('cookie.txt')
csdn.run(user_path)
if __name__ == "__main__":
main()

3
requirements.txt Normal file
View File

@ -0,0 +1,3 @@
bs4==0.0.1
selenium==3.141.0
requests==2.22.0

155
tomd.py Normal file
View File

@ -0,0 +1,155 @@
import re
__all__ = ['Tomd', 'convert']
MARKDOWN = {
'h1': ('\n# ', '\n'),
'h2': ('\n## ', '\n'),
'h3': ('\n### ', '\n'),
'h4': ('\n#### ', '\n'),
'h5': ('\n##### ', '\n'),
'h6': ('\n###### ', '\n'),
'code': ('`', '`'),
'ul': ('', ''),
'ol': ('', ''),
'li': ('- ', ''),
'blockquote': ('\n> ', '\n'),
'em': ('**', '**'),
'strong': ('**', '**'),
'block_code': ('\n```\n', '\n```\n'),
'span': ('', ''),
'p': ('\n', '\n'),
'p_with_out_class': ('\n', '\n'),
'inline_p': ('', ''),
'inline_p_with_out_class': ('', ''),
'b': ('**', '**'),
'i': ('*', '*'),
'del': ('~~', '~~'),
'hr': ('\n---', '\n\n'),
'thead': ('\n', '|------\n'),
'tbody': ('\n', '\n'),
'td': ('|', ''),
'th': ('|', ''),
'tr': ('', '\n')
}
BlOCK_ELEMENTS = {
'h1': '<h1.*?>(.*?)</h1>',
'h2': '<h2.*?>(.*?)</h2>',
'h3': '<h3.*?>(.*?)</h3>',
'h4': '<h4.*?>(.*?)</h4>',
'h5': '<h5.*?>(.*?)</h5>',
'h6': '<h6.*?>(.*?)</h6>',
'hr': '<hr/>',
'blockquote': '<blockquote.*?>(.*?)</blockquote>',
'ul': '<ul.*?>(.*?)</ul>',
'ol': '<ol.*?>(.*?)</ol>',
'block_code': '<pre.*?><code.*?>(.*?)</code></pre>',
'p': '<p\s.*?>(.*?)</p>',
'p_with_out_class': '<p>(.*?)</p>',
'thead': '<thead.*?>(.*?)</thead>',
'tr': '<tr>(.*?)</tr>'
}
INLINE_ELEMENTS = {
'td': '<td>(.*?)</td>',
'tr': '<tr>(.*?)</tr>',
'th': '<th>(.*?)</th>',
'b': '<b>(.*?)</b>',
'i': '<i>(.*?)</i>',
'del': '<del>(.*?)</del>',
'inline_p': '<p\s.*?>(.*?)</p>',
'inline_p_with_out_class': '<p>(.*?)</p>',
'code': '<code.*?>(.*?)</code>',
'span': '<span.*?>(.*?)</span>',
'ul': '<ul.*?>(.*?)</ul>',
'ol': '<ol.*?>(.*?)</ol>',
'li': '<li.*?>(.*?)</li>',
'img': '<img.*?src="(.*?)".*?>(.*?)</img>',
'a': '<a.*?href="(.*?)".*?>(.*?)</a>',
'em': '<em.*?>(.*?)</em>',
'strong': '<strong.*?>(.*?)</strong>'
}
DELETE_ELEMENTS = ['<span.*?>', '</span>', '<div.*?>', '</div>']
class Element:
def __init__(self, start_pos, end_pos, content, tag, is_block=False):
self.start_pos = start_pos
self.end_pos = end_pos
self.content = content
self._elements = []
self.is_block = is_block
self.tag = tag
self._result = None
if self.is_block:
self.parse_inline()
def __str__(self):
wrapper = MARKDOWN.get(self.tag)
self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1])
return self._result
def parse_inline(self):
for tag, pattern in INLINE_ELEMENTS.items():
if tag == 'a':
self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content)
elif tag == 'img':
self.content = re.sub(pattern, '![\g<2>](\g<1>)', self.content)
elif self.tag == 'ul' and tag == 'li':
self.content = re.sub(pattern, '- \g<1>', self.content)
elif self.tag == 'ol' and tag == 'li':
self.content = re.sub(pattern, '1. \g<1>', self.content)
elif self.tag == 'thead' and tag == 'tr':
self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', ''))
elif self.tag == 'tr' and tag == 'th':
self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))
elif self.tag == 'tr' and tag == 'td':
self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))
else:
wrapper = MARKDOWN.get(tag)
self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content)
class Tomd:
def __init__(self, html='', options=None):
self.html = html
self.options = options
self._markdown = ''
def convert(self, html, options=None):
elements = []
for tag, pattern in BlOCK_ELEMENTS.items():
for m in re.finditer(pattern, html, re.I | re.S | re.M):
element = Element(start_pos=m.start(),
end_pos=m.end(),
content=''.join(m.groups()),
tag=tag,
is_block=True)
can_append = True
for e in elements:
if e.start_pos < m.start() and e.end_pos > m.end():
can_append = False
elif e.start_pos > m.start() and e.end_pos < m.end():
elements.remove(e)
if can_append:
elements.append(element)
elements.sort(key=lambda element: element.start_pos)
self._markdown = ''.join([str(e) for e in elements])
for index, element in enumerate(DELETE_ELEMENTS):
self._markdown = re.sub(element, '', self._markdown)
return self._markdown
@property
def markdown(self):
self.convert(self.html, self.options)
return self._markdown
_inst = Tomd()
convert = _inst.convert

1
username.txt Normal file
View File

@ -0,0 +1 @@
ds19991999