同步CSDN博客至wordpress
获取待同步的文章id列表
在内容管理页面使用自动爬取的方法不太好爬取,仅能获取某一页的数据(20条文章),切换到其他页时请求头发生变化导致无法获取内容。因此这里仅爬取首页公开的文章列表(所以不是那么全)。
用户主页的地址为 https://blog.csdn.net/ +用户名,打开主页后审查元素,进入网络,并刷新页面,在某一文件内可以看到返回的数据:
将该文件复制为cURL(bash),并粘贴到Convert curl commands to code 以自动生成代码。
cookies = {
...
}
headers = {
...
}
params = {
'userName': '...', 'page': '1',
}
r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
要获取的文章列表可能在不同的page上,通过params参数控制page。获取文章总数,计算出有多少page(每个page有40条文章),然后遍历即可:
articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
article_count=r1.json()['data']['total']
for p in range(2,math.ceil(article_count/40)+1):
params = {
'userName': '用户名',
'page': str(p),
}
r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
page_articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
if page_articleId_list:
articleId_list.extend(page_articleId_list)
获取文章内容
每一篇文章详情页的url格式为 https://blog.csdn.net/用户名/article/details/文章id,这里我们获取每一篇文章的内容,然后用BeautifulSoup库解析出标题和正文内容并写入数据库。
同样使用刚才的方法,打开某篇文章,审查元素,进入网络,并刷新页面,在某一文件内可以看到返回的数据,将该文件复制为cURL(bash),并粘贴到Convert curl commands to code 以自动生成代码:
cookies = {
...
}
headers = {
...
}
params = {
...
}
title_content_dict={}
r2 = requests.get(
'https://blog.csdn.net/用户名/article/details/'+str(a),
params=params,
cookies=cookies,
headers=headers,
)
查看文章标题和内容所使用的标记,发现:
正文内容放在id为content_views的div块中,直接用find()方法获取即可,返回值仍为BeautifulSoup对象;
文章的标题可使用find()方法获取,传入参数为:标记符名称span,及值为tit的class属性用于筛选(这里描述和下图不一致,可能也可以使用其他标记符);

soup=BeautifulSoup(r2.text)
article_title=soup.find('span',attrs={'class':'tit'}).string
content_div_soup=soup.find('div',id='content_views')
将文章写入wordpress数据库
wordpress的文章内容存放在 wordpress数据库>wp_posts表中:
mysql> use wordpress; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +-----------------------+ | Tables_in_wordpress | +-----------------------+ | wp_commentmeta | | wp_comments | | wp_links | | wp_options | | wp_postmeta | | wp_posts | | wp_term_relationships | | wp_term_taxonomy | | wp_termmeta | | wp_terms | | wp_usermeta | | wp_users | +-----------------------+ 12 rows in set (0.00 sec) mysql> desc wp_posts; +-----------------------+---------------------+------+-----+---------------------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------------------+---------------------+------+-----+---------------------+----------------+ | ID | bigint(20) unsigned | NO | PRI | NULL | auto_increment | | post_author | bigint(20) unsigned | NO | MUL | 0 | | | post_date | datetime | NO | | 0000-00-00 00:00:00 | | | post_date_gmt | datetime | NO | | 0000-00-00 00:00:00 | | | post_content | longtext | NO | | NULL | | | post_title | text | NO | | NULL | | | post_excerpt | text | NO | | NULL | | | post_status | varchar(20) | NO | | publish | | | comment_status | varchar(20) | NO | | open | | | ping_status | varchar(20) | NO | | open | | | post_password | varchar(255) | NO | | | | | post_name | varchar(200) | NO | MUL | | | | to_ping | text | NO | | NULL | | | pinged | text | NO | | NULL | | | post_modified | datetime | NO | | 0000-00-00 00:00:00 | | | post_modified_gmt | datetime | NO | | 0000-00-00 00:00:00 | | | post_content_filtered | longtext | NO | | NULL | | | post_parent | bigint(20) unsigned | NO | MUL | 0 | | | guid | varchar(255) | NO | | | | | menu_order | int(11) | NO | | 0 | | | post_type | varchar(20) | NO | MUL | post | | | post_mime_type | varchar(100) | NO | | | | | comment_count | bigint(20) | NO | | 0 | | +-----------------------+---------------------+------+-----+---------------------+----------------+ 23 rows in set (0.04 sec)
向表中写入文章数据时,填入必填字段即可,包括
- post_date
- post_date_gmt
- post_modified
- post_modified_gmt
- post_title:文章标题
- post_excerpt
- to_ping
- pinged
- post_content_filtered
- post_content:文章正文内容
以上字段写入时均可传入字符串参数。前四个为时间,可设置为当前时间;最重要的字段为post_title和post_content,分别为文章标题和内容;其他字段置为空字符串即可;
这里使用pymysql写入数据库,使用escape_string处理单引号、双引号等特殊字符。如果将源代码直接写入数据库,图片是无法显示的,这里将图片链接前加上//images.weserv.nl/?url=进行处理后方可显示:
con = pymysql.connect(host='12.34.56.78',user='root',port=3306,database='wordpress',password='123456')
cur= con.cursor()
content_str=str(content_div_soup).replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
sql='''insert into wp_posts(
post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
post_excerpt,to_ping,pinged,post_content_filtered,post_content)
values(
'%s','%s','%s','%s','%s',
'%s','%s','%s','%s','%s')'''
%(time,time,time,time,article_title,
'','','','',escape_string(content_str))
cur.execute(sql)
con.commit()
写入后,刷新博客主页,就能看到新生成的文章了。
全部代码
from bs4 import BeautifulSoup
import re,requests,math,pymysql
from tqdm import tqdm
# v1.0.0及以上
from pymysql.converters import escape_string
# v0.10.1及以下
#from pymysql import escape_string
# 从主页获取所有文章编号,这里只能获取公开的。#如果从内容管理页面获取,仅能获取某页的20个,切换到其他页时,请求头会变,导致无法获取到数据
cookies = {
...,
}
headers = {
...,
}
params = {
...,
}
r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
article_count=r1.json()['data']['total']
for p in range(2,math.ceil(article_count/40)+1):
params = {
'userName': '...',
'page': str(p),
}
r1 = requests.get('https://blog.csdn.net/phoenix/web/article/list', params=params, cookies=cookies, headers=headers)
page_articleId_list=[d['articleId'] for d in r1.json()['data']['articleList']]
if page_articleId_list:
articleId_list.extend(page_articleId_list)
print(len(articleId_list))
# 将文章写入wordpress数据库
con = pymysql.connect(host='12.34.56.78',user='Admin',port=3306,database='wordpress',password='123456')
cur= con.cursor()
cookies = {
...,
}
headers = {
...,
}
params = {
...,
}
for a in tqdm(articleId_list[:]):
r2 = requests.get(
'https://blog.csdn.net/用户名/article/details/'+str(a),
params=params,
cookies=cookies,
headers=headers,
)
soup=BeautifulSoup(r2.text)
article_title=soup.find('span',attrs={'class':'tit'}).string
content_div_soup=soup.find('div',id='content_views')
content_str=str(content_div_soup).replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
sql='''insert into wp_posts(
post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
post_excerpt,to_ping,pinged,post_content_filtered,post_content)
values(
'%s','%s','%s','%s','%s',
'%s','%s','%s','%s','%s')'''
%(time,time,time,time,article_title,
'','','','',escape_string(content_str))
cur.execute(sql)
con.commit()
print('Finished.')
改进
按照以上操作,文章内容倒是可以同步过去,但是在wordpress中进行编辑时所有内容都放在了一个块内,仅能用经典编辑器去编辑,与在wordpress中手写文章时差异较大。
分析一个手写wordpress文章的源代码可以发现,源代码除了常规的标记符外,还用注释标记符将常规标记符前后包围了起来,并以空行隔开:
段落:标记内容放在<!– wp:paragraph –>和<!– /wp:paragraph –>之间;
标题:标记内容放在<!– wp:heading {“level”:1} –>和<!– /wp:heading –>之间;
图片:标记内容放在<!– wp:image {“id”:18,”sizeSlug”:”large”,”linkDestination”:”none”} –> 和</figure> <!– /wp:image –>之间;如果同一行有多张图片,外面再用
<!– wp:gallery {“linkTo”:”none”} –>和<!– /wp:gallery –>包含;
列表向:<!– wp:list-item –>和<!– /wp:list-item –>;
<!-- wp:heading {"level":1} -->
<h1 class="wp-block-heading">一级标题1</h1>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>这是一级标题的内容。</p>
<!-- /wp:paragraph -->
<!-- wp:heading -->
<h2 class="wp-block-heading">二级标题</h2>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>图片和动图</p>
<!-- /wp:paragraph -->
<!-- wp:gallery {"linkTo":"none"} -->
<figure class="wp-block-gallery has-nested-images columns-default is-cropped"><!-- wp:image {"id":18,"sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large"><img src="http://noting.online/wp-content/uploads/2025/07/Snipaste_2025-07-31_02-03-07.png" alt="" class="wp-image-18"/></figure>
<!-- /wp:image -->
<!-- wp:image {"id":20,"width":"206px","height":"auto","sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large is-resized"><img src="http://noting.online/wp-content/uploads/2025/07/视频样本.gif" alt="" class="wp-image-20" style="width:206px;height:auto"/></figure>
<!-- /wp:image --></figure>
<!-- /wp:gallery -->
<!-- wp:heading {"level":3} -->
<h3 class="wp-block-heading">三级标题</h3>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>一个超链接</p>
<!-- /wp:paragraph -->
<!-- wp:paragraph -->
<p><a href="https://www.runoob.com/linux/linux-file-attr-permission.html">https://www.runoob.com/linux/linux-file-attr-permission.html</a></p>
<!-- /wp:paragraph -->
<!-- wp:heading {"level":4} -->
<h4 class="wp-block-heading">四级标题</h4>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>一段代码</p>
<!-- /wp:paragraph -->
<!-- wp:wb/wp-vk {"vkMark":""} -->
<!-- /wp:wb/wp-vk -->
<!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:code -->
<pre class="wp-block-code"><code>print('Hello World')</code></pre>
<!-- /wp:code --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group -->
<!-- wp:heading {"level":5} -->
<h5 class="wp-block-heading">五级标题</h5>
<!-- /wp:heading -->
<!-- wp:html -->
[]
<!-- /wp:html -->
<!-- wp:paragraph -->
<p>一个表格</p>
<!-- /wp:paragraph -->
<!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"grid"}} -->
<div class="wp-block-group"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:columns -->
<div class="wp-block-columns"><!-- wp:column {"width":"100%"} -->
<div class="wp-block-column" style="flex-basis:100%"><!-- wp:group {"layout":{"type":"grid"}} -->
<div class="wp-block-group"><!-- wp:group {"layout":{"type":"constrained"}} -->
<div class="wp-block-group"><!-- wp:table -->
<figure class="wp-block-table"><table class="has-fixed-layout"><tbody><tr><td></td><td>列1</td><td>列2</td></tr><tr><td>行1</td><td></td><td></td></tr><tr><td>行2</td><td></td><td></td></tr></tbody></table></figure>
<!-- /wp:table --></div>
<!-- /wp:group --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns --></div>
<!-- /wp:group --></div>
<!-- /wp:group --></div>
<!-- /wp:column --></div>
<!-- /wp:columns -->
<!-- wp:heading {"level":6} -->
<h6 class="wp-block-heading">六级标题</h6>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>无序列表</p>
<!-- /wp:paragraph -->
<!-- wp:list -->
<ul class="wp-block-list"><!-- wp:list-item -->
<li>无序列表项1</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>无序列表项2</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>无序列表项3</li>
<!-- /wp:list-item --></ul>
<!-- /wp:list -->
<!-- wp:paragraph -->
<p>有序列表</p>
<!-- /wp:paragraph -->
<!-- wp:list {"ordered":true} -->
<ol class="wp-block-list"><!-- wp:list-item -->
<li>有序列表项1</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>有序列表项2</li>
<!-- /wp:list-item -->
<!-- wp:list-item -->
<li>有序列表项3</li>
<!-- /wp:list-item --></ol>
<!-- /wp:list -->
<!-- wp:heading {"level":1} -->
<h1 class="wp-block-heading">一级标题2</h1>
<!-- /wp:heading -->
<!-- wp:paragraph -->
<p>一级标题2的正文</p>
<!-- /wp:paragraph -->
<!-- wp:wb/wp-vk {"vkMark":""} -->
<!-- /wp:wb/wp-vk -->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
<!-- wp:categories /-->
<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->
所以使用python向数据库中插入代码时,要想达到与手写相同的效果,需要将获取的html代码按照wordpress的格式进行处理。
此处仅以段落和标题为例,在标记符前后加上wordpress中的标记符,使用正则表达式替换实现,其他地方稍作修改,:
for a in tqdm(articleId_list[:]):
r2 = requests.get(
'https://blog.csdn.net/用户名/article/details/'+str(a),
params=params,
cookies=cookies,
headers=headers,
)
soup=BeautifulSoup(r2.text)
article_title=soup.find('span',attrs={'class':'tit'}).string
content_div_soup=soup.find('div',id='content_views')
s=''.join(list(map(str,content_div_soup.contents[1].contents)))
s=s.replace('//images.weserv.nl/?url=https://i-blog.csdnimg.cn','//images.weserv.nl/?url=//images.weserv.nl/?url=https://i-blog.csdnimg.cn')
s=re.sub("(<h[1-6]>)",r"<!-- wp:heading -->1",s)
s=re.sub("(</h[1-6]>)",r"1<!-- /wp:heading -->",s)
s=re.sub("(<p>)",r"<!-- wp:paragraph -->1",s)
s=re.sub("(</p>)",r"1<!-- /wp:paragraph -->",s)
time=datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
sql='''insert into wp_posts(
post_date,post_date_gmt,post_modified,post_modified_gmt,post_title,
post_excerpt,to_ping,pinged,post_content_filtered,post_content)
values(
'%s','%s','%s','%s','%s',
'%s','%s','%s','%s','%s')'''
%(time,time,time,time,article_title,
'','','','',escape_string(s))
cur.execute(sql)
con.commit()
print('Finished.')
Previous Post
No previous posts评论已关闭。