跳转至主要内容

财务自由之路

同步CSDN博客至wordpress

获取待同步的文章id列表

在内容管理页面使用自动爬取的方法不太好爬取,仅能获取某一页的数据(20条文章),切换到其他页时请求头发生变化导致无法获取内容。因此这里仅爬取首页公开的文章列表(所以不是那么全)。

用户主页的地址为 https://blog.csdn.net/ +用户名,打开主页后审查元素,进入网络,并刷新页面,在某一文件内可以看到返回的数据:

将该文件复制为cURL(bash),并粘贴到Convert curl commands to code 以自动生成代码。

cookies = {
    ...
}

headers = {
    ...
}

params = {
    'userName': '...',     'page': '1',
}
r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)

要获取的文章列表可能在不同的page上,通过params参数控制page。获取文章总数,计算出有多少page(每个page有40条文章),然后遍历即可:

articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
article_count=r1.json()['data']['total']
for p in range(2,math.ceil(article_count/40)+1):
    params = {
    'userName': '用户名',
    'page': str(p),
    }
    r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
    page_articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
    if page_articleId_list:
        articleId_list.extend(page_articleId_list)

获取文章内容

每一篇文章详情页的url格式为  https://blog.csdn.net/用户名/article/details/文章id,这里我们获取每一篇文章的内容,然后用BeautifulSoup库解析出标题和正文内容并写入数据库。

同样使用刚才的方法,打开某篇文章,审查元素,进入网络,并刷新页面,在某一文件内可以看到返回的数据,将该文件复制为cURL(bash),并粘贴到Convert curl commands to code 以自动生成代码:

cookies = {
    ...
}

headers = {
    ...
}

params = {
    ...
}
title_content_dict={}
r2 = requests.get(
        'https://blog.csdn.net/用户名/article/details/'+str(a),
        params=params,
        cookies=cookies,
        headers=headers,
)

查看文章标题和内容所使用的标记,发现:

正文内容放在id为content_views的div块中,直接用find()方法获取即可,返回值仍为BeautifulSoup对象;

文章的标题可使用find()方法获取,传入参数为:标记符名称span,及值为tit的class属性用于筛选(这里描述和下图不一致,可能也可以使用其他标记符);


 

soup=BeautifulSoup(r2.text)
article_title=soup.find('span',attrs={'class':'tit'}).string
content_div_soup=soup.find('div',id='content_views')

将文章写入wordpress数据库

wordpress的文章内容存放在 wordpress数据库>wp_posts表中:

mysql> use wordpress;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------+
| Tables_in_wordpress   |
+-----------------------+
| wp_commentmeta        |
| wp_comments           |
| wp_links              |
| wp_options            |
| wp_postmeta           |
| wp_posts              |
| wp_term_relationships |
| wp_term_taxonomy      |
| wp_termmeta           |
| wp_terms              |
| wp_usermeta           |
| wp_users              |
+-----------------------+
12 rows in set (0.00 sec)

mysql> desc wp_posts;
+-----------------------+---------------------+------+-----+---------------------+----------------+
| Field                 | Type                | Null | Key | Default             | Extra          |
+-----------------------+---------------------+------+-----+---------------------+----------------+
| ID                    | bigint(20) unsigned | NO   | PRI | NULL                | auto_increment |
| post_author           | bigint(20) unsigned | NO   | MUL | 0                   |                |
| post_date             | datetime            | NO   |     | 0000-00-00 00:00:00 |                |
| post_date_gmt         | datetime            | NO   |     | 0000-00-00 00:00:00 |                |
| post_content          | longtext            | NO   |     | NULL                |                |
| post_title            | text                | NO   |     | NULL                |                |
| post_excerpt          | text                | NO   |     | NULL                |                |
| post_status           | varchar(20)         | NO   |     | publish             |                |
| comment_status        | varchar(20)         | NO   |     | open                |                |
| ping_status           | varchar(20)         | NO   |     | open                |                |
| post_password         | varchar(255)        | NO   |     |                     |                |
| post_name             | varchar(200)        | NO   | MUL |                     |                |
| to_ping               | text                | NO   |     | NULL                |                |
| pinged                | text                | NO   |     | NULL                |                |
| post_modified         | datetime            | NO   |     | 0000-00-00 00:00:00 |                |
| post_modified_gmt     | datetime            | NO   |     | 0000-00-00 00:00:00 |                |
| post_content_filtered | longtext            | NO   |     | NULL                |                |
| post_parent           | bigint(20) unsigned | NO   | MUL | 0                   |                |
| guid                  | varchar(255)        | NO   |     |                     |                |
| menu_order            | int(11)             | NO   |     | 0                   |                |
| post_type             | varchar(20)         | NO   | MUL | post                |                |
| post_mime_type        | varchar(100)        | NO   |     |                     |                |
| comment_count         | bigint(20)          | NO   |     | 0                   |                |
+-----------------------+---------------------+------+-----+---------------------+----------------+
23 rows in set (0.04 sec)

向表中写入文章数据时,填入必填字段即可,包括

  • post_date
  • post_date_gmt
  • post_modified
  • post_modified_gmt
  • post_title:文章标题
  • post_excerpt
  • to_ping
  • pinged
  • post_content_filtered
  • post_content:文章正文内容

 以上字段写入时均可传入字符串参数。前四个为时间,可设置为当前时间;最重要的字段为post_title和post_content,分别为文章标题和内容;其他字段置为空字符串即可;

这里使用pymysql写入数据库,使用escape_string处理单引号、双引号等特殊字符。如果将源代码直接写入数据库,图片是无法显示的,这里将图片链接前加上//images.weserv.nl/?url=进行处理后方可显示:

con = pymysql.connect(host='12.34.56.78',user='root',port=3306,database='wordpress',password='123456')
cur= con.cursor()

content_str=str(content_div_soup).replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
sql='''insert into wp_posts(
        post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
        post_excerpt,to_ping,pinged,post_content_filtered,post_content) 
        values( 
        '%s','%s','%s','%s','%s',
        '%s','%s','%s','%s','%s')''' 
        %(time,time,time,time,article_title,
          '','','','',escape_string(content_str))
cur.execute(sql)
con.commit()

写入后,刷新博客主页,就能看到新生成的文章了。

全部代码

from bs4 import BeautifulSoup
import re,requests,math,pymysql
from tqdm import tqdm
# v1.0.0及以上
from pymysql.converters import escape_string
# v0.10.1及以下
#from pymysql import escape_string

# 从主页获取所有文章编号,这里只能获取公开的。#如果从内容管理页面获取,仅能获取某页的20个,切换到其他页时,请求头会变,导致无法获取到数据
cookies = {
    ...,
}

headers = {
    ...,
}

params = {
    ...,
}

r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
article_count=r1.json()['data']['total']
for p in range(2,math.ceil(article_count/40)+1):
    params = {
    'userName': '...',
    'page': str(p),
    }
    r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
    page_articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
    if page_articleId_list:
        articleId_list.extend(page_articleId_list)
print(len(articleId_list))


# 将文章写入wordpress数据库
con = pymysql.connect(host='12.34.56.78',user='Admin',port=3306,database='wordpress',password='123456')
cur= con.cursor()

cookies = {
    ...,
}

headers = {
    ...,
}

params = {
    ...,
}
for a in tqdm(articleId_list[:]):
    r2 = requests.get(
        'https://blog.csdn.net/用户名/article/details/'+str(a),
        params=params,
        cookies=cookies,
        headers=headers,
    )
    soup=BeautifulSoup(r2.text)
    article_title=soup.find('span',attrs={'class':'tit'}).string
    content_div_soup=soup.find('div',id='content_views')
    content_str=str(content_div_soup).replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
    time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    sql='''insert into wp_posts(
        post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
        post_excerpt,to_ping,pinged,post_content_filtered,post_content) 
        values( 
        '%s','%s','%s','%s','%s',
        '%s','%s','%s','%s','%s')''' 
        %(time,time,time,time,article_title,
          '','','','',escape_string(content_str))
    cur.execute(sql)
    con.commit()
print('Finished.')

改进

按照以上操作,文章内容倒是可以同步过去,但是在wordpress中进行编辑时所有内容都放在了一个块内,仅能用经典编辑器去编辑,与在wordpress中手写文章时差异较大。

分析一个手写wordpress文章的源代码可以发现,源代码除了常规的标记符外,还用注释标记符将常规标记符前后包围了起来,并以空行隔开:

段落:标记内容放在<!– wp:paragraph –>和<!– /wp:paragraph –>之间;

标题:标记内容放在<!– wp:heading {“level”:1} –>和<!– /wp:heading –>之间;

图片:标记内容放在<!– wp:image {“id”:18,”sizeSlug”:”large”,”linkDestination”:”none”} –> 和</figure> <!– /wp:image –>之间;如果同一行有多张图片,外面再用

<!– wp:gallery {“linkTo”:”none”} –>和<!– /wp:gallery –>包含;

列表向:<!– wp:list-item –>和<!– /wp:list-item –>;

<!-- wp:heading {"level":1} -->
<h1 class="wp-block-heading">一级标题1</h1>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>这是一级标题的内容。</p>
<!-- /wp:paragraph -->

<!-- wp:heading -->
<h2 class="wp-block-heading">二级标题</h2>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>图片和动图</p>
<!-- /wp:paragraph -->

<!-- wp:gallery {"linkTo":"none"} -->
<figure class="wp-block-gallery has-nested-images columns-default is-cropped"><!-- wp:image {"id":18,"sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large"><img src="http://noting.online/wp-content/uploads/2025/07/Snipaste_2025-07-31_02-03-07.png" alt="" class="wp-image-18"/></figure>
<!-- /wp:image -->

<!-- wp:image {"id":20,"width":"206px","height":"auto","sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large is-resized"><img src="http://noting.online/wp-content/uploads/2025/07/视频样本.gif" alt="" class="wp-image-20" style="width:206px;height:auto"/></figure>
<!-- /wp:image --></figure>
<!-- /wp:gallery -->

<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">三级标题</h3>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>一个超链接</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><a href="https://www.runoob.com/linux/linux-file-attr-permission.html">https://www.runoob.com/linux/linux-file-attr-permission.html</a></p>
<!-- /wp:paragraph -->

<!-- wp:heading {"level":4} -->
<h4 class="wp-block-heading">四级标题</h4>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>一段代码</p>
<!-- /wp:paragraph -->

<!-- wp:wb/wp-vk {"vkMark":""} -->

<!-- /wp:wb/wp-vk -->

<!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:code -->
<pre class="wp-block-code"><code>print('Hello World')</code></pre>
<!-- /wp:code --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group -->

<!-- wp:heading {"level":5} -->
<h5 class="wp-block-heading">五级标题</h5>
<!-- /wp:heading -->

<!-- wp:html -->
[]
<!-- /wp:html -->

<!-- wp:paragraph -->
<p>一个表格</p>
<!-- /wp:paragraph -->

<!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"grid"}} -->
<div class="wp-block-group"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"grid"}} -->
<div class="wp-block-group"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:table -->
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td></td><td>列1</td><td>列2</td></tr><tr><td>行1</td><td></td><td></td></tr><tr><td>行2</td><td></td><td></td></tr></tbody></table></figure>
<!-- /wp:table --></div>
<!-- /wp:group --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns -->

<!-- wp:heading {"level":6} -->
<h6 class="wp-block-heading">六级标题</h6>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>无序列表</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul class="wp-block-list"><!-- wp:list-item -->
<li>无序列表项1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>无序列表项2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>无序列表项3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>有序列表</p>
<!-- /wp:paragraph -->

<!-- wp:list {"ordered":true} -->
<ol class="wp-block-list"><!-- wp:list-item -->
<li>有序列表项1</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>有序列表项2</li>
<!-- /wp:list-item -->

<!-- wp:list-item -->
<li>有序列表项3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->

<!-- wp:heading {"level":1} -->
<h1 class="wp-block-heading">一级标题2</h1>
<!-- /wp:heading -->

<!-- wp:paragraph -->
<p>一级标题2的正文</p>
<!-- /wp:paragraph -->

<!-- wp:wb/wp-vk {"vkMark":""} -->

<!-- /wp:wb/wp-vk -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:categories /-->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

所以使用python向数据库中插入代码时,要想达到与手写相同的效果,需要将获取的html代码按照wordpress的格式进行处理。

此处仅以段落和标题为例,在标记符前后加上wordpress中的标记符,使用正则表达式替换实现,其他地方稍作修改,:

for a in tqdm(articleId_list[:]):
    r2 = requests.get(
        'https://blog.csdn.net/用户名/article/details/'+str(a),
        params=params,
        cookies=cookies,
        headers=headers,
    )
    soup=BeautifulSoup(r2.text)
    article_title=soup.find('span',attrs={'class':'tit'}).string
    content_div_soup=soup.find('div',id='content_views')
    s=''.join(list(map(str,content_div_soup.contents[1].contents)))
    s=s.replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
    s=re.sub("(<h[1-6]>)",r"<!-- wp:heading -->1",s)
    s=re.sub("(</h[1-6]>)",r"1<!-- /wp:heading -->",s)
    s=re.sub("(<p>)",r"<!-- wp:paragraph -->1",s)
    s=re.sub("(</p>)",r"1<!-- /wp:paragraph -->",s)
    time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    sql='''insert into wp_posts(
        post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
        post_excerpt,to_ping,pinged,post_content_filtered,post_content) 
        values( 
        '%s','%s','%s','%s','%s',
        '%s','%s','%s','%s','%s')''' 
        %(time,time,time,time,article_title,
          '','','','',escape_string(s))
    cur.execute(sql)
    con.commit()
print('Finished.')

分类:

Previous Post

No previous posts

Next Post

使用LSTM神经网络预测股价涨跌 

评论已关闭。