python – 王春伟的技术博客

Python 3.11 正式版发布，比 3.10 快 10-60%，官方：这或许是最好的版本

终于，Python 3.11 正式版发布了！

2020 年 1 月 1 日，Python 官方结束了对 Python 2 的维护，这意味着 Python 2 已完全退休，进入了 Python 3 时代。打从进入 3 版本以来，Python 官方已经发布了众多修改分支，现在来到了最新的版本 Python 3.11。

其实研究界有个不公开的秘密，那就是 Python 运行速度并不快但容易上手，因此使用人数超级多，在众多最受欢迎语言榜单中 Python 多次位列第一。很多开发者都期待这门语言的性能有所提升，还有人畅想 Python 4 会不会在某个不经意的时刻到来，有这种想法的人可以放一放了，Python 之父 Van Rossum 都说了，Python 4.0 可能不会来了。

Van Rossum 曾表示：「我和 Python 核心开发团队的成员对 Python 4.0 没什么想法，提不起兴趣，估计至少会一直编号到 3.33。Python 的加速是渐进式的，3.11 版本会有新的速度提升，预计会比 3.10 快得多。」

正如 Van Rossum 所说，根据官方资料显示最新发布的 Python 3.11 比 Python 3.10 快 10-60%，对用户更友好。这一版本历经 17 个月的开发，现在公开可用。

Python 3.11 的具体改进主要表现在：更详实的 Error Tracebacks、更快的代码执行、更好的异步任务语法、改进类型变量、支持 TOML 配置解析以及一些其他非常酷的功能（包括快速启动、Zero-Cost 异常处理、异常组等）。

Python 指导委员会成员和核心开发者、Python3.10/3.11 发布管理者 Pablo Galindo Salgado 表示，为了使 3.11 成为最好的 Python 版本，我们付出了很多努力。

Python 3.11 新特性

Error Tracebacks

Python 这门编程语言对初学者非常友好，它具有易于理解的语法和强大的数据结构。但对于刚刚接触 Python 的人来说却存在一个难题，即如何解释当 Python 遇到错误时显示的 traceback。

Python 3.11 将 Decorative annotation 添加到 tracebacks 中，以帮助用户更快地解释错误消息。想要获得这种功能，可以将以下代码添加到 inverse.py 文件中。

举例来说，你可以使用 inverse() 来计算一个数的倒数。因为 0 没有倒数，所以在运行下列代码时会抛出一个错误。

注意嵌入在 traceback 中的 ^ 和~ 符号，它们指向导致错误的代码。与此前的 tracebacks 一样，你应该从底层开始，然后逐步向上。这种操作对发现错误非常有用，但如果代码过于复杂，带注释的 tracebacks 会更好。

更快的代码执行

Python 以速度慢著称，例如在 Python 中，常规循环比 C 中的类似循环慢几个数量级。

Python 官方正在着手改进这一缺陷。2020 年秋，Mark Shannon 提出了关于 Python 的几个性能改进。这个提议被称为香农计划 (Shannon Plan)，他们希望通过几个版本的更新将 Python 的速度提高 5 倍。不久之后微软正式加入该计划，该公司正在支持包括 Mark Shannon、Guido van Rossum 在内的开发人员，致力于「Faster CPython」项目的研究。

「Faster CPython」项目中的一个重要提案是 PEP 659，在此基础上，Python 3.11 有了许多改进。

PEP 659 描述了一种「specializing adaptive interpreter」。主要思想是通过优化经常执行的操作来加快代码运行速度，这类似于 JIT（just-in-time）编译。只是它不影响编译，相反，Python 的字节码是动态调整或可更改的。

研究人员在字节码生成中添加了一个名为「quickening」的新步骤，从而可以在运行时优化指令，并将它们替换为 adaptive 指令。

一旦函数被调用了一定次数，quickening 指令就会启动。在 CPython 3.11 中，八次调用之后就会启动 quickening。你可以通过调用 dis() 并设置 adaptive 参数来观察解释器如何适应字节码。

在基准测试中，CPython 3.11 比 CPython 3.10 平均快 25%。Faster CPython 项目是一个正在进行的项目，已经有几个优化计划在 2023 年 10 月与 Python 3.12 一起发布。你可以在 GitHub 上关注该项目。

项目地址：https://github.com/faster-cpython/ideas

更好的异步任务语法

Python 中对异步编程的支持已经发展了很长时间。Python 2 时代添加了生成器，asyncio 库最初是在 Python 3.4 中添加的，而 async 和 await 关键字是在 Python 3.5 中添加的。在 Python 3.11 中，你可以使用任务组（task groups），它为运行和监视异步任务提供了更简洁的语法。

改进的类型变量

Python 是一种动态类型语言，但它通过可选的类型提示支持静态类型。Python 静态类型系统的基础在 2015 年的 PEP 484 中定义。自 Python 3.5 以来，每个 Python 版本都引入了几个与类型相关的新提案。

Python 3.11 发布了 5 个与类型相关的 PEP，创下新高：

PEP 646: 可变泛型
PEP 655: 根据需要或可能丢失的情况标记单个 TypedDict 项
PEP 673: Self 类型
PEP 675: 任意文字字符串类型
PEP 681: 数据类转换

支持 TOML 配置解析

TOML 是 Tom’s Obvious Minimal Language 的缩写。这是一种在过去十年中流行起来的配置文件格式。在为包和项目指定元数据时，Python 社区已将 TOML 作为首选格式。

虽然 TOML 已被使用多年，但 Python 并没有内置的 TOML 支持。当 tomllib 添加到标准库时，Python 3.11 中的情况发生了变化。这个新模块建立在 toml 第三方库之上，允许解析 TOML 文件。

以下是名为 units.toml 的 TOML 文件示例：

其他功能

除了以上主要更新和改进之外，Python 3.11 还有更多值得探索的功能，比如更快的程序启动速度、对异常的更多改变以及对字符串格式的小幅改进。

更快的程序启动速度

Faster CPython 项目的一大成果是实现了更快的启动时间。当你运行 Python 脚本时，解释器初始化需要一些操作。这就导致即便是最简单的程序也需要几毫秒才能运行。

在很多情况下，与运行代码所需时间相比，启动程序需要的时间可以忽略不计。但是在运行时间较短的脚本中，如典型的命令行应用程序，启动时间可能会显著影响程序性能。比如考虑如下脚本，它受到了经典 cowsay 程序的启发。

在 snakesay.py 中，你从命令行读取一条消息，然后将这条消息打印在带有一条可爱蛇的对话气泡中。你可以让蛇说任何话。这是命令行应用程序的基本示例，它运行得很快，但仍需要几毫秒。这一开销的很大部分发生在 Python 导入模块时。

你可以使用 – X importtime 选项来显示导入模块所用的时间。表中的数字为微秒为单位，最后一列是模块名称的格式。

该示例分别运行在 Python 3.11 和 3.10 上，结果如下图所示，Python 3.11 的导入速度更快，有助于 Python 程序更快地启动。

零成本异常

异常的内部表示在 Python 3.11 中有所不同。异常对象更轻量级，并且异常处理发生了变化。因此只要不触发 except 字句，try … except 块中的开销就越小。

所谓的零成本异常受到了 C++ 和 Java 等其他语言的启发。当你的源代码被编译为字节码时，编译器创建跳转表，由此来实现零成本异常。如果引发异常，查询这些跳转表。如果没有异常，则 try 块中的代码没有运行时开销。

异常组

此前，你了解到了任务组以及它们如何同时处理多个错误。这都要归功于一个被称为异常组的新功能。

我们可以这样考虑异常组，它们是包装了其他几种常规异常的常规异常。虽然异常组在很多方面表现得像常规异常，但它们也支持特殊语法，帮助你有效地处理每个包装异常。如下所示，你可以通过给出一个描述并列出包装的异常来创建一个异常组。

异常 Notes

常规异常具有添加任意 notes 的扩展能力。你可以使用. add_note() 向任何异常添加一个 note，并通过检查.__notes__属性来查看现有 notes。

负零格式化

使用浮点数进行计算时可能会遇到一个奇怪概念——负零。你可以观察到负零和 regular zero 在 REPL 中呈现不同，如下所示。

原文链接：https://mp.weixin.qq.com/s/E-HmgUAQy88ySzo8fJuF5Q

python抓取的一些总结

python作为爬虫的最常见语言，很有自己的优势。这里举一些常见的用法。

1，使用scrapy框架。

https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

2, 纯脚本

lib库

from concurrent.futures import ThreadPoolExecutor

from bs4 import BeautifulSoup

import Queue, time, random

import requests

提供一个比较粗糙的代码，使用到了python代理，queue,多线程,BeautifulSoup。最终print方法应该用线程锁避免打印错误。


#coding=utf-8
import requests
from concurrent.futures import ThreadPoolExecutor
import Queue, time, random
#import pymysql
from bs4 import BeautifulSoup
import re
import urllib
import urllib2
import gzip
import cStringIO
import datetime
import json
from StringIO import StringIO
import threading

import base64

index_url = 'https://www.juzimi.com/alltags'
page_url = 'https://www.juzimi.com/alltags?page=%s'

task_queue = Queue.Queue()

has_words = []
has_words_value = {}

lock=threading.Lock()

ip = ""
def getIp() :
global ip

url = 'http://s.zdaye.com/?api=201903221353043521&count=1&px=2'
ipret = curl(url, '', False, False)
time.sleep(5)
print "get ip:" + str(ipret)
ip = str(ipret)
def curl(url, data = '', isCompress = False, use_ip = True):

global ip
if(data):
data = urllib.urlencode(data)
else:
data = None
#headers = {"method":"GET","user-agent":self.ua,"Referer":self.refer, "Cookie":self.cookie, "Upgrade-Insecure-Requests": 1,"Accept-Encoding": "gzip, deflate, br"}
headers = {"method":"GET","Accept-Encoding": "gzip, deflate, br", "user-agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"}

try :

#代理代码
if(use_ip) :
opener = urllib2.build_opener(urllib2.ProxyHandler({"https" : ip}))
urllib2.install_opener(opener)

request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request, timeout=10)

#response = opener.open(request)
if(isCompress) :
buf = StringIO(response.read())
data = gzip.GzipFile(fileobj=buf)
return data.read()
return response.read()

except :
exit()
getIp()
print "get ip retry"
self.curl(url, data, isCompress, use_ip)

#exit()

def setTaskR() :
task_queue.put({'url':'https://www.juzimi.com/tags/%E6%96%B0%E8%AF%97', 'title':'诗词'})

def setTask() :
#for i in range(0, 12) :
for i in range(0, 12) :
url = page_url % (i)
content = curl(url, '', True)
#print url
soup = BeautifulSoup(content, "html.parser")
span = soup.findAll('div',{'class':'views-field-name'})
for tmp in span :
data = {}
href = tmp.find('a').get('href')
title = tmp.find('a').get('title')
data = {"url":"https://www.juzimi.com" + href, "title" : title}
print data
task_queue.put(data)
#print tmp.get('title')
time.sleep(1)

def getFile() :

global has_words
global has_words_value

for line in open('word.log') :
if(line.find('juzimi.com') != -1) :
continue
line = line.split(":", 1)
if(len(line) > 1 and line[1] == 'test') :
continue

#print line[0]
if(not line[0] in has_words) :
has_words.append(line[0])
has_words_value[line[0]] = 1
#print line[0]
else :
has_words_value[line[0]] = has_words_value[line[0]] + 1
has_words = []
for k in has_words_value:
if(has_words_value[k] > 100) :
has_words.append(k)
for line in open('word.url') :
lines = eval(line)
url = lines['url']
title = lines['title'].encode('utf-8')
if(title in has_words) :
continue
task_queue.put(lines)

#runTask()
sleep_time = random.randint(30,60)
#time.sleep(sleep_time)
#time.sleep(60)

def runTask() :
while(not task_queue.empty()) :
data = task_queue.get()
printinfo =[]
hotword = data['title']
url = data['url']
hotword = hotword.encode('utf-8')
#print url
lastIndex = 0
content = curl(url, '', True)
#content = re.sub(r'\n|&nbsp|\xa0|\\xa0|\u3000|\\u3000|\\u0020|\u0020', '', str(content))
content = content.replace('<br/>', '')
content = content.replace('\r', ' ')
soup = BeautifulSoup(content, "html.parser")
last = soup.find('li', {'class', 'pager-last'})
if (not last) :

last = soup.findAll('li', {'class' : 'pager-item'})
if(not last) :
print "get empty:" + url
continue
for tmp in last :
if(int(tmp.text) > lastIndex) :
#print int(tmp.text)
lastIndex = int(tmp.text)
else :
lastIndex = last.text

span = soup.findAll('div',{'class', 'views-field-phpcode-1'})

if(not span) :
print "get empty:" + url
continue

#print url
for tmp in span :
words = tmp.findAll('a')
for word in words :
#printinfo.append({'hotword' : hotword, 'content' : word.text.encode('utf-8')})
print hotword + ":" + word.text.encode('utf-8')
#time.sleep(3)
sleep_time = random.randint(10,20)
#time.sleep(sleep_time)
for i in range(1, int(lastIndex)) :

url = "https://www.juzimi.com/tags/" +hotword + "?page=" + str(i)
#ret = getContent(url, hotword)

t = threading.Thread(target=getContent, args=(url, hotword))
t.start()
"""
for tmp in ret:
printinfo.append(tmp)
"""

"""
for tmp in printinfo :
print tmp['hotword'] + ":" + tmp['content']
"""

def getContent(url, hotword) :
printinfo =[]
#print url
content = curl(url, '', True)
#content = re.sub(r'\n|&nbsp|\xa0|\\xa0|\u3000|\\u3000|\\u0020|\u0020', '', str(content))
content = content.replace('<br/>', '')
content = content.replace('\r', ' ')
soup = BeautifulSoup(content, "html.parser")
last = soup.find('li', {'class', 'pager-last'})
span = soup.findAll('div',{'class', 'views-field-phpcode-1'})
for tmp in span :
words = tmp.findAll('a')
for word in words :
#printinfo.append({'hotword' : hotword, 'content' : word.text.encode('utf-8')})
print hotword + ":" + word.text.encode('utf-8')
sleep_time = random.randint(20,30)
#time.sleep(sleep_time)
#return printinfo

getIp()

getFile()

"""

#setTaskR()
#runTask()
#exit()
"""
executor = ThreadPoolExecutor(max_workers = 50)
#executor.submit(runTask)
#exit()
for i in range(0, 20) :
executor.submit(runTask)

&nbsp;

python 抓取环境库

最近在用python抓取一些文章，python的库比较丰富，在抓取上的确很有优势，下面是引入的一些库

更新：pip install –upgrade pip

pip2 install pymysql
pip install Pillow
pip install beautifulsoup4
pip install lxml
pip install html5lib

包括了Mysql，图片解析和beautifulsoup的相关库，当然还可以用phantomjs做前端抓取，非常有用。