Python版本的词云生成模块从2015年的v1.0到现在,已经更新到了v1.7。
下载请移步至:https://pypi.org/project/wordcloud/
wordcloud简单应用:
-
import jieba
-
import wordcloud
-
-
w = wordcloud.WordCloud(
-
width=
600,
-
height=
600,
-
background_color=
'white',
-
font_path=
'msyh.ttc'
-
)
-
text =
'看到此标题,我也是感慨万千 首先弄清楚搞IT和被IT搞,谁是搞IT的?马云就是,马化腾也是,刘强东也是,他们都是叫搞IT的, 但程序员只是被IT搞的人,可以比作盖楼砌砖的泥瓦匠,你想想,四十岁的泥瓦匠能跟二十左右岁的年轻人较劲吗?如果你是老板你会怎么做?程序员只是技术含量高的泥瓦匠,社会是现实的,社会的现实是什么?利益驱动。当你跑的速度不比以前快了时,你就会被挨鞭子赶,这种窘境如果在做程序员当初就预料到的话,你就会知道,到达一定高度时,你需要改变行程。 程序员其实真的不是什么好职业,技术每天都在更新,要不停的学,你以前学的每天都在被淘汰,加班可能是标配了吧。 热点,你知道什么是热点吗?社会上啥热就是热点,我举几个例子:在早淘宝之初,很多人都觉得做淘宝能让自己发展,当初的规则是产品按时间轮候展示,也就是你的商品上架时间一到就会被展示,不论你星级多高。这种一律平等的条件固然好,但淘宝随后调整了显示规则,对产品和店铺,销量进行了加权,一下导致小卖家被弄到了很深的胡同里,没人看到自己的产品,如何卖?做广告费用也非常高,入不敷出,想必做过淘宝的都知道,再后来淘宝弄天猫,显然,天猫是上档次的商城,不同于淘宝的摆地摊,因为摊位费涨价还闹过事,闹也白闹,你有能力就弄,没能力就淘汰掉。前几天淘宝又推出C2M,客户反向定制,客户直接挂钩大厂家,没你小卖家什么事。 后来又出现了微商,在微商出现当天我就知道这东西不行,它比淘宝假货还下三滥.我对TX一直有点偏见,因为骗子都使用QQ 我说这么多只想说一个事,世界是变化的,你只能适应变化,否则就会被淘汰。 还是回到热点这个话题,育儿嫂这个职位有很多人了解吗?前几年放开二胎后,这个职位迅速串红,我的一个亲戚初中毕业,现在已经月入一万五,职务就是照看刚出生的婴儿28天,节假日要双薪。 你说这难到让我一个男的去当育儿嫂吗?扯,我只是说热点问题。你没踩在热点上,你赚钱就会很费劲 这两年的热点是什么?短视频,你可以看到抖音的一些作品根本就不是普通人能实现的,说明专业级人才都开始努力往这上使劲了。 我只会编程,别的不会怎么办?那你就去编程。没人用了怎么办?你看看你自己能不能雇佣你自己 学会适应社会,学会改变自己去适应社会 最后说一句:科大讯飞的刘鹏说的是对的。那我为什么还做程序员?他可以完成一些原始积累,只此而已。'
-
new_str =
' '.join(jieba.lcut(text))
-
w.generate(new_str)
-
w.to_file(
'x.png')
下面分析源码:
wordcloud源码中生成词云图的主要步骤有:
1、分割词组
2、生成词云
3、保存图片
我们从 generate(self, text)切入,发现它仅仅调用了自身对象的一个方法 self.generate_from_text(text)
-
def generate_from_text(self, text):
-
"""Generate wordcloud from text.
-
"""
-
words = self.process_text(text)
# 分割词组
-
self.generate_from_frequencies(words)
# 生成词云的主要方法(重点分析)
-
return self
process_text()源码如下,处理的逻辑比较简单:分割词组、去除数字、去除's、去除数字、去除短词、去除禁用词等。
-
def process_text(self, text):
-
"""Splits a long text into words, eliminates the stopwords.
-
-
Parameters
-
----------
-
text : string
-
The text to be processed.
-
-
Returns
-
-------
-
words : dict (string, int)
-
Word tokens with associated frequency.
-
-
..versionchanged:: 1.2.2
-
Changed return type from list of tuples to dict.
-
-
Notes
-
-----
-
There are better ways to do word tokenization, but I don't want to
-
include all those things.
-
"""
-
-
flags = (re.UNICODE
if sys.version <
'3'
and type(text)
is unicode
else
0)
-
-
regexp = self.regexp
if self.regexp
is
not
None
else
r"\w[\w']+"
-
-
# 获得分词
-
words = re.findall(regexp, text, flags)
-
# 去除 's
-
words = [word[:
-2]
if word.lower().endswith(
"'s")
else word
for word
in words]
-
# 去除数字
-
if
not self.include_numbers:
-
words = [word
for word
in words
if
not word.isdigit()]
-
# 去除短词,长度小于指定值min_word_length的词,被视为短词,筛除
-
if self.min_word_length:
-
words = [word
for word
in words
if len(word) >= self.min_word_length]
-
# 去除禁用词
-
stopwords = set([i.lower()
for i
in self.stopwords])
-
if self.collocations:
-
word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)
-
else:
-
# remove stopwords
-
words = [word
for word
in words
if word.lower()
not
in stopwords]
-
word_counts, _ = process_tokens(words, self.normalize_plurals)
-
-
return word_counts
重头戏来了
generate_from_frequencies(self, frequencies, max_font_size=None) 方法体内的代码比较多,总体上分为以下几步:
1、排序
2、词频归一化
3、创建绘图对象
4、确定初始字体大小(字号)
5、扩展单词集
6、确定每个单词的字体大小、位置、旋转角度、颜色等信息
源码如下(根据个人理解已添加中文注释):
-
def generate_from_frequencies(self, frequencies, max_font_size=None):
-
"""Create a word_cloud from words and frequencies.
-
-
Parameters
-
----------
-
frequencies : dict from string to float
-
A contains words and associated frequency.
-
-
max_font_size : int
-
Use this font-size instead of self.max_font_size
-
-
Returns
-
-------
-
self
-
-
"""
-
# make sure frequencies are sorted and normalized
-
# 1、排序
-
# 对“单词-频率”列表按频率降序排序
-
frequencies = sorted(frequencies.items(), key=itemgetter(
1), reverse=
True)
-
if len(frequencies) <=
0:
-
raise ValueError(
"We need at least 1 word to plot a word cloud, "
-
"got %d." % len(frequencies))
-
# 确保单词数在设置的最大范围内,超出的部分被舍弃掉
-
frequencies = frequencies[:self.max_words]
-
-
# largest entry will be 1
-
# 取第一个单词的频率作为最大词频
-
max_frequency = float(frequencies[
0][
1])
-
-
# 2、词频归一化
-
# 把所有单词的词频归一化,由于单词已经排序,所以归一化后应该是这样的:[('xxx', 1),('xxx', 0.96),('xxx', 0.87),...]
-
frequencies = [(word, freq / max_frequency)
-
for word, freq
in frequencies]
-
-
# 随机对象,用于产生一个随机数,来确定是否旋转90度
-
if self.random_state
is
not
None:
-
random_state = self.random_state
-
else:
-
random_state = Random()
-
-
if self.mask
is
not
None:
-
boolean_mask = self._get_bolean_mask(self.mask)
-
width = self.mask.shape[
1]
-
height = self.mask.shape[
0]
-
else:
-
boolean_mask =
None
-
height, width = self.height, self.width
-
# 用于查找单词可能放置的位置,例如图片有效范围内的空白处(非文字区域)
-
occupancy = IntegralOccupancyMap(height, width, boolean_mask)
-
-
# 3、创建绘图对象
-
# create image
-
img_grey = Image.new(
"L", (width, height))
-
draw = ImageDraw.Draw(img_grey)
-
img_array = np.asarray(img_grey)
-
font_sizes, positions, orientations, colors = [], [], [], []
-
-
last_freq =
1.
-
-
# 4、确定初始字号
-
# 确定最大字号
-
if max_font_size
is
None:
-
# if not provided use default font_size
-
max_font_size = self.max_font_size
-
-
# 如果最大字号是空的,就需要确定一个最大字号作为初始字号
-
if max_font_size
is
None:
-
# figure out a good font size by trying to draw with
-
# just the first two words
-
if len(frequencies) ==
1:
-
# we only have one word. We make it big!
-
font_size = self.height
-
else:
-
# 递归进入当前函数,以获得一个self.layout_,其中只有前两个单词的词频信息
-
# 使用这两个词频计算出一个初始字号
-
self.generate_from_frequencies(dict(frequencies[:
2]),
-
max_font_size=self.height)
-
# find font sizes
-
sizes = [x[
1]
for x
in self.layout_]
-
try:
-
font_size = int(
2 * sizes[
0] * sizes[
1]
-
/ (sizes[
0] + sizes[
1]))
-
# quick fix for if self.layout_ contains less than 2 values
-
# on very small images it can be empty
-
except IndexError:
-
try:
-
font_size = sizes[
0]
-
except IndexError:
-
raise ValueError(
-
"Couldn't find space to draw. Either the Canvas size"
-
" is too small or too much of the image is masked "
-
"out.")
-
else:
-
font_size = max_font_size
-
-
# we set self.words_ here because we called generate_from_frequencies
-
# above... hurray for good design?
-
self.words_ = dict(frequencies)
-
-
# 5、扩展单词集
-
# 如果单词数不足最大值,则扩展单词集以达到最大值
-
if self.repeat
and len(frequencies) < self.max_words:
-
# pad frequencies with repeating words.
-
times_extend = int(np.ceil(self.max_words / len(frequencies))) -
1
-
# get smallest frequency
-
frequencies_org = list(frequencies)
-
downweight = frequencies[
-1][
1]
-
# 扩展单词数,词频会保持原有词频的递减规则。
-
for i
in range(times_extend):
-
frequencies.extend([(word, freq * downweight ** (i +
1))
-
for word, freq
in frequencies_org])
-
-
# 6、确定每一个单词的字体大小、位置、旋转角度、颜色等信息
-
# start drawing grey image
-
for word, freq
in frequencies:
-
if freq ==
0:
-
continue
-
# select the font size
-
rs = self.relative_scaling
-
if rs !=
0:
-
font_size = int(round((rs * (freq / float(last_freq))
-
+ (
1 - rs)) * font_size))
-
if random_state.random() < self.prefer_horizontal:
-
orientation =
None
-
else:
-
orientation = Image.ROTATE_90
-
tried_other_orientation =
False
-
# 寻找可能放置的位置,如果寻找一次,没有找到,则尝试改变文字方向或缩小字体大小,继续寻找。
-
# 直到找到放置位置或者字体大小超出字号下限
-
while
True:
-
# try to find a position
-
font = ImageFont.truetype(self.font_path, font_size)
-
# transpose font optionally
-
transposed_font = ImageFont.TransposedFont(
-
font, orientation=orientation)
-
# get size of resulting text
-
box_size = draw.textsize(word, font=transposed_font)
-
# find possible places using integral image:
-
result = occupancy.sample_position(box_size[
1] + self.margin,
-
box_size[
0] + self.margin,
-
random_state)
-
if result
is
not
None
or font_size < self.min_font_size:
-
# either we found a place or font-size went too small
-
break
-
# if we didn't find a place, make font smaller
-
# but first try to rotate!
-
if
not tried_other_orientation
and self.prefer_horizontal <
1:
-
orientation = (Image.ROTATE_90
if orientation
is
None
else
-
Image.ROTATE_90)
-
tried_other_orientation =
True
-
else:
-
font_size -= self.font_step
-
orientation =
None
-
-
if font_size < self.min_font_size:
-
# we were unable to draw any more
-
break
-
-
# 收集该词的信息:字体大小、位置、旋转角度、颜色
-
x, y = np.array(result) + self.margin //
2
-
# actually draw the text
-
# 此处绘制图像仅仅用于寻找放置单词的位置,而不是最终的词云图片。词云图片是在另一个函数中生成:to_image
-
draw.text((y, x), word, fill=
"white", font=transposed_font)
-
positions.append((x, y))
-
orientations.append(orientation)
-
font_sizes.append(font_size)
-
colors.append(self.color_func(word, font_size=font_size,
-
position=(x, y),
-
orientation=orientation,
-
random_state=random_state,
-
font_path=self.font_path))
-
# recompute integral image
-
if self.mask
is
None:
-
img_array = np.asarray(img_grey)
-
else:
-
img_array = np.asarray(img_grey) + boolean_mask
-
# recompute bottom right
-
# the order of the cumsum's is important for speed ?!
-
occupancy.update(img_array, x, y)
-
last_freq = freq
-
-
# layout_是单词信息列表,表中每项信息:单词、频率、字体大小、位置、旋转角度、颜色等信息。为后续步骤的绘图工作做好准备。
-
self.layout_ = list(zip(frequencies, font_sizes, positions,
-
orientations, colors))
-
return self
注意
在第6步确定位置时,程序使用循环和随机数来查找合适的放置位置,源码如下。
-
# 寻找可能放置的位置,如果寻找一次,没有找到,则尝试改变文字方向或缩小字体大小,继续寻找。
-
# 直到找到放置位置或者字体大小超出字号下限
-
while
True:
-
# try to find a position
-
font = ImageFont.truetype(self.font_path, font_size)
-
# transpose font optionally
-
transposed_font = ImageFont.TransposedFont(
-
font, orientation=orientation)
-
# get size of resulting text
-
box_size = draw.textsize(word, font=transposed_font)
-
# find possible places using integral image:
-
result = occupancy.sample_position(box_size[
1] + self.margin,
-
box_size[
0] + self.margin,
-
random_state)
-
if result
is
not
None
or font_size < self.min_font_size:
-
# either we found a place or font-size went too small
-
break
-
# if we didn't find a place, make font smaller
-
# but first try to rotate!
-
if
not tried_other_orientation
and self.prefer_horizontal <
1:
-
orientation = (Image.ROTATE_90
if orientation
is
None
else
-
Image.ROTATE_90)
-
tried_other_orientation =
True
-
else:
-
font_size -= self.font_step
-
orientation =
None
其中 occupancy.sample_position() 是具体寻找合适位置的方法。当你试图进一步了解其中的奥秘时,却发现你的【Ctrl+左键】已经无法跳转到深层代码了,悲哀的事情还是发生了......o(╥﹏╥)o
在wordcloud.py文件的顶部有这么一行: from .query_integral_image import query_integral_image 而query_integral_image 是一个pyd文件,该文件无法直接查看。有关pyd格式的更多资料,请自行查阅。
再回到 generate_from_frequencies 上来,方法的最后把数据整理到了 self.layout_ 变量里,这里面就是所有词组绘制时所需要的信息了。然后就可以调用to_file()方法,保存图片了。
-
def to_file(self, filename):
-
-
img = self.to_image()
-
img.save(filename, optimize=
True)
-
return self
核心方法 to_image() 就会把self.layout_里的信息依次取出,绘制每一个词组。
-
def to_image(self):
-
self._check_generated()
-
if self.mask
is
not
None:
-
width = self.mask.shape[
1]
-
height = self.mask.shape[
0]
-
else:
-
height, width = self.height, self.width
-
-
img = Image.new(self.mode, (int(width * self.scale),
-
int(height * self.scale)),
-
self.background_color)
-
draw = ImageDraw.Draw(img)
-
for (word, count), font_size, position, orientation, color
in self.layout_:
-
font = ImageFont.truetype(self.font_path,
-
int(font_size * self.scale))
-
transposed_font = ImageFont.TransposedFont(
-
font, orientation=orientation)
-
pos = (int(position[
1] * self.scale),
-
int(position[
0] * self.scale))
-
draw.text(pos, word, fill=color, font=transposed_font)
-
-
return self._draw_contour(img=img)
引申思考:
查找文字合适的放置该怎样实现呢?(注意:文字笔画的空隙里也是可以放置更小一字号的文字)
~ End ~
转载:https://blog.csdn.net/bailichun19901111/article/details/106118092