您现在的位置是:主页 > news > 定制网站建设需要多少钱/新手做电商怎么起步
定制网站建设需要多少钱/新手做电商怎么起步
admin2025/5/30 0:20:40【news】
简介定制网站建设需要多少钱,新手做电商怎么起步,武汉网站建设网站,大型网站建设公司 北京1.LDA简单介绍 输入: 文档的列表(list of doc)输出:1)各个文档的主题分布Θ,假设主题类别个数为K(K为超参数)。类似于图1,这里k3 2)每个主题下,单词出现的概率矩阵Φ&a…
定制网站建设需要多少钱,新手做电商怎么起步,武汉网站建设网站,大型网站建设公司 北京1.LDA简单介绍
输入: 文档的列表(list of doc)输出:1)各个文档的主题分布Θ,假设主题类别个数为K(K为超参数)。类似于图1,这里k3 2)每个主题下,单词出现的概率矩阵Φ&a…3.近似方法计算
1.LDA简单介绍
- 输入: 文档的列表(list of doc)
- 输出:1)各个文档的主题分布Θ,假设主题类别个数为K(K为超参数)。类似于图1,这里k=3
2)每个主题下,单词出现的概率矩阵Φ,类似于表1,矩阵的行数为词典库大小V,列数为主题个数K
- Θ的作用: 可以作为文档的特征向量输入到 分类器中,用来做情感分析等
- Φ的作用:我们可以把每个主题下,概率最高的几个单词抽取出来,然后判断每个主题与什么内容相关。比如假设主题1与水果(apple等)相关,主题2与游戏相关…
2. 回顾模型参数评估方法
3.近似方法计算
(1) 因为 θ 有无穷多个,所以积分无法计算出来
(2) 近似方法:我们有一个θ空间,里面有无穷个θ,但是我们并不需要所有的θ,而是 采用 一些 θ, 使用马尔科夫蒙特卡洛(MCMC)采样法,可将上式近似成
S:采样的θ 的个数,θs 是从 p(θ|D) 中采样来的
4. sklearn实现LDA
代码来自于某贪心科技老师,数据集链接:https://pan.baidu.com/s/1aUXzmxyjMK87M4JqJfVMKA
提取码:w2j7
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
yelp = pd.read_csv('./datasets/yelp.csv', encoding='utf-8')
#print (yelp['text'].head())
yelp['text'].head() # Series
0 My wife took me here on my birthday for breakf...
1 I have no idea why some people give bad review...
2 love the gyro plate. Rice is so good and I als...
3 Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4 General Manager Scott Petello is a good egg!!!...
Name: text, dtype: object
# 把每个文本表示成count vector,也就是文本中各个单词词频统一构成的向量
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(yelp['text'])
X.shape # 有10000个文本,词典库的大小为28880
(10000, 28880)
vectorizer.get_feature_names()[:15], # len: 28880, 获取词典库中的每个单词
(['00','000','007','00a','00am','00pm','01','02','03','03342','04','05','06','07','08'],)
sklearn的CountVectorizer()函数用法 参考:sklearn的CountVectorizer()函数用法
#实例化LDA模型,设置主题个数为2
lda = LatentDirichletAllocation(n_components=2, learning_method="batch", random_state=42)
model = lda.fit(X)
model.components_, model.components_.shape # 这个其实类似上文提到的 概率矩阵Φ。K = 2, V = 28880,只是没有归一化而已
(array([[103.39623788, 0.65479342, 0.50044583, ..., 1.49973798,0.95780575, 1.49980579],[121.60376212, 24.34520658, 1.49955417, ..., 0.50026202,1.04219425, 0.50019421]]),(2, 28880))
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np%matplotlib inline####
# Per topic: (token, pseudocount)
# pseudocount represents the number of times word j was assigned to topic i
#
# We can convert these to a normalized form -- well you don't have to
# but it's easier to understand the output in this form. Also, this is
# consistent with how Gensim performs. After some consideration, we will plot these out.
####def display_topics(model, feature_names, n_words = 10, plot = False):topics_tokens = []for topic_idx, topic in enumerate(model.components_): #topic:ndarray类型topic = zip(feature_names, topic) # 返回一个迭代器,里面的每个元素是元组 (单词i, topic分量),就是没有归一化的概率矩阵Φ第 topic_idx列构成的元组topic = sorted(topic, key=lambda pair: pair[1]) # 按每列的頻数,从小到大排序,返回一个列表#print(topic[:6]) # [('cinder', 0.5000646411839081), ('glances', 0.5000646411839081), ('jeeps', 0.5000646411839081), ('joker', 0.5000646411839081), ('retiring', 0.5000646411839081), ('squinting', 0.5000646411839081)]topic_words = [(token, counts)for token, counts in topic[:-n_words - 1:-1]] #topic[:-n_words - 1:-1]:取倒数前 n_words个元组# token: 单词;counts: 单词token 在 topic下出现的次数topics_tokens.append(topic_words) #存放每种主题下,頻数最大的前 n_word 的元组#print(topics_tokens) # Topic 0:#[[('food', 5284.7138897847635), ('good', 5252.347333068924), ('place', 4198.939003047488), ('great', 3409.2599648858773), ('like', 2975.1150778672973), ('just', 2576.781149213765), ('really', 2078.480762684414), ('service', 2009.0567499869278), ('chicken', 1743.9562288778454), ('time', 1723.093052308628)]]if not plot:print ("Topic %d:" % (topic_idx))print (topic_words)if plot:#plot_matrix = np.arange(10).reshape(5,2)fig, ax = plt.subplots(figsize=(10, 10), nrows=2, ncols=2) #fig: 画布, ax:子图topics = [{key: value for key, value in topic} for topic in topics_tokens] # key: 单词,value: 单词对应的頻数, topics是列表,有两个字典型的元素#print(topics) :[{'food': 5284.7138897847635, 'good': 5252.347333068924, 'place': 4198.939003047488, 'great': 3409.2599648858773, 'like': 2975.1150778672973}, {'place': 2464.06099695238, 'like': 2066.884922132562, 'just': 1991.218850786098, 'time': 1781.9069476912282, 'great': 1718.7400351139627}]row = 0for topic_id, topic in enumerate(topics): #topic: 字典column = (0 if topic_id % 2 == 0 else 1) # 当topic_id = 0时 第raw行第0列,当topic_id = 1时 第raw行第1列chart = pd.DataFrame([topic]).iloc[0].sort_values(axis=0) # .iloc[0]取出dataframe 的 第 0 行,对頻数从小到大排序print("*****chart******")print(chart)chart.plot(kind="barh", title="Topic %d" % topic_id, ax=ax[row, column])row += 1 if column == 1 else 0plt.tight_layout()display_topics(model, vectorizer.get_feature_names(), n_words=5, plot=True)
*****chart******
like 2975.115078
great 3409.259965
place 4198.939003
good 5252.347333
food 5284.713890
Name: 0, dtype: float64
*****chart******
great 1718.740035
time 1781.906948
just 1991.218851
like 2066.884922
place 2464.060997
Name: 0, dtype: float64
display_topics(model, vectorizer.get_feature_names(), n_words=5)
Topic 0:
[('food', 5284.7138897847635), ('good', 5252.347333068924), ('place', 4198.939003047488), ('great', 3409.2599648858773), ('like', 2975.1150778672973)]
Topic 1:
[('place', 2464.06099695238), ('like', 2066.884922132562), ('just', 1991.218850786098), ('time', 1781.9069476912282), ('great', 1718.7400351139627)]
comp = model.transform(X)
print(comp.shape) # 得到 文档的主题分布Θdocument_topics = pd.DataFrame(comp, columns=["topic%d" % i for i in range(comp.shape[1])])
document_topics.shape
(10000, 2)topic 0 topic 1
0 0.987979 0.012021
1 0.790358 0.209642
2 0.914572 0.085428
3 0.012651 0.987349
4 0.086899 0.913101
... ... ...
9995 0.988700 0.011300
9996 0.988655 0.011345
9997 0.738667 0.261333
9998 0.009002 0.990998
9999 0.756130 0.243870
10000 rows × 2 columnstop_topics = document_topics['topic 0'] > .8
document_topics[top_topics].head() # 返回 topic 0 下概率大于 0.8 的行,也就是文本