您现在的位置是：主页 > news > 定制网站建设需要多少钱/新手做电商怎么起步

定制网站建设需要多少钱/新手做电商怎么起步

admin2025/5/30 0:20:40【news】

简介定制网站建设需要多少钱,新手做电商怎么起步,武汉网站建设网站,大型网站建设公司北京1.LDA简单介绍输入： 文档的列表（list of doc)输出：1）各个文档的主题分布Θ，假设主题类别个数为K（K为超参数)。类似于图1，这里k3 2）每个主题下，单词出现的概率矩阵Φ&a…

定制网站建设需要多少钱,新手做电商怎么起步,武汉网站建设网站,大型网站建设公司北京1.LDA简单介绍输入： 文档的列表（list of doc)输出：1）各个文档的主题分布Θ，假设主题类别个数为K（K为超参数)。类似于图1，这里k3 2）每个主题下，单词出现的概率矩阵Φ&a…

1.LDA简单介绍

输入：文档的列表（list of doc)
输出：1）各个文档的主题分布Θ，假设主题类别个数为K（K为超参数)。类似于图1，这里k=3

2）每个主题下，单词出现的概率矩阵Φ，类似于表1，矩阵的行数为词典库大小V,列数为主题个数K
Θ的作用: 可以作为文档的特征向量输入到分类器中，用来做情感分析等
Φ的作用:我们可以把每个主题下，概率最高的几个单词抽取出来，然后判断每个主题与什么内容相关。比如假设主题1与水果(apple等)相关，主题2与游戏相关…

2. 回顾模型参数评估方法

在这里插入图片描述

3.近似方法计算

(1) 因为 θ 有无穷多个,所以积分无法计算出来
(2) 近似方法：我们有一个θ空间，里面有无穷个θ，但是我们并不需要所有的θ，而是采用一些 θ，使用马尔科夫蒙特卡洛（MCMC）采样法，可将上式近似成
S: 采用
S：采样的θ 的个数，θs 是从 p(θ|D) 中采样来的

4. sklearn实现LDA

代码来自于某贪心科技老师，数据集链接：https://pan.baidu.com/s/1aUXzmxyjMK87M4JqJfVMKA
提取码：w2j7

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
yelp = pd.read_csv('./datasets/yelp.csv', encoding='utf-8')
#print (yelp['text'].head())
yelp['text'].head() # Series
0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
2    love the gyro plate. Rice is so good and I als...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
Name: text, dtype: object
# 把每个文本表示成count vector，也就是文本中各个单词词频统一构成的向量
vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(yelp['text'])  
X.shape  # 有10000个文本，词典库的大小为28880
(10000, 28880)
vectorizer.get_feature_names()[:15],   # len: 28880, 获取词典库中的每个单词
(['00','000','007','00a','00am','00pm','01','02','03','03342','04','05','06','07','08'],)

sklearn的CountVectorizer()函数用法参考：sklearn的CountVectorizer()函数用法

#实例化LDA模型，设置主题个数为2
lda   = LatentDirichletAllocation(n_components=2, learning_method="batch", random_state=42)
model = lda.fit(X)
model.components_,   model.components_.shape  # 这个其实类似上文提到的 概率矩阵Φ。K = 2， V = 28880，只是没有归一化而已
(array([[103.39623788,   0.65479342,   0.50044583, ...,   1.49973798,0.95780575,   1.49980579],[121.60376212,  24.34520658,   1.49955417, ...,   0.50026202,1.04219425,   0.50019421]]),(2, 28880))
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np%matplotlib inline####
# Per topic: (token, pseudocount)
# pseudocount represents the number of times word j was assigned to topic i
# 
# We can  convert these to a normalized form -- well you don't have to
# but it's easier to understand the output in this form.  Also, this is 
# consistent with how Gensim performs.  After some consideration, we will plot these out.
####def display_topics(model, feature_names, n_words = 10, plot = False):topics_tokens = []for topic_idx, topic in enumerate(model.components_):  #topic:ndarray类型topic = zip(feature_names, topic) # 返回一个迭代器，里面的每个元素是元组 (单词i, topic分量)，就是没有归一化的概率矩阵Φ第 topic_idx列构成的元组topic = sorted(topic, key=lambda pair: pair[1])  # 按每列的頻数，从小到大排序，返回一个列表#print(topic[:6]) # [('cinder', 0.5000646411839081), ('glances', 0.5000646411839081), ('jeeps', 0.5000646411839081), ('joker', 0.5000646411839081), ('retiring', 0.5000646411839081), ('squinting', 0.5000646411839081)]topic_words = [(token, counts)for token, counts in topic[:-n_words - 1:-1]]  #topic[:-n_words - 1:-1]:取倒数前 n_words个元组# token: 单词；counts: 单词token 在 topic下出现的次数topics_tokens.append(topic_words) #存放每种主题下，頻数最大的前 n_word 的元组#print(topics_tokens)   # Topic 0:#[[('food', 5284.7138897847635), ('good', 5252.347333068924), ('place', 4198.939003047488), ('great', 3409.2599648858773), ('like', 2975.1150778672973), ('just', 2576.781149213765), ('really', 2078.480762684414), ('service', 2009.0567499869278), ('chicken', 1743.9562288778454), ('time', 1723.093052308628)]]if not plot:print ("Topic %d:" % (topic_idx))print (topic_words)if plot:#plot_matrix = np.arange(10).reshape(5,2)fig, ax = plt.subplots(figsize=(10, 10), nrows=2, ncols=2) #fig: 画布， ax：子图topics = [{key: value for key, value in topic} for topic in topics_tokens]  # key: 单词，value: 单词对应的頻数, topics是列表，有两个字典型的元素#print(topics) ：[{'food': 5284.7138897847635, 'good': 5252.347333068924, 'place': 4198.939003047488, 'great': 3409.2599648858773, 'like': 2975.1150778672973}, {'place': 2464.06099695238, 'like': 2066.884922132562, 'just': 1991.218850786098, 'time': 1781.9069476912282, 'great': 1718.7400351139627}]row = 0for topic_id, topic in enumerate(topics): #topic: 字典column = (0 if topic_id % 2 == 0 else 1)  # 当topic_id = 0时  第raw行第0列，当topic_id = 1时  第raw行第1列chart = pd.DataFrame([topic]).iloc[0].sort_values(axis=0)  # .iloc[0]取出dataframe 的 第 0 行，对頻数从小到大排序print("*****chart******")print(chart)chart.plot(kind="barh", title="Topic %d" % topic_id, ax=ax[row, column])row += 1 if column == 1 else 0plt.tight_layout()display_topics(model, vectorizer.get_feature_names(), n_words=5, plot=True)
*****chart******
like     2975.115078
great    3409.259965
place    4198.939003
good     5252.347333
food     5284.713890
Name: 0, dtype: float64
*****chart******
great    1718.740035
time     1781.906948
just     1991.218851
like     2066.884922
place    2464.060997
Name: 0, dtype: float64

在这里插入图片描述

display_topics(model, vectorizer.get_feature_names(), n_words=5)
Topic 0:
[('food', 5284.7138897847635), ('good', 5252.347333068924), ('place', 4198.939003047488), ('great', 3409.2599648858773), ('like', 2975.1150778672973)]
Topic 1:
[('place', 2464.06099695238), ('like', 2066.884922132562), ('just', 1991.218850786098), ('time', 1781.9069476912282), ('great', 1718.7400351139627)]
comp = model.transform(X)
print(comp.shape)  # 得到  文档的主题分布Θdocument_topics = pd.DataFrame(comp, columns=["topic%d" % i for i in range(comp.shape[1])])
document_topics.shape
(10000, 2)topic 0	topic 1
0	0.987979	0.012021
1	0.790358	0.209642
2	0.914572	0.085428
3	0.012651	0.987349
4	0.086899	0.913101
...	...	...
9995	0.988700	0.011300
9996	0.988655	0.011345
9997	0.738667	0.261333
9998	0.009002	0.990998
9999	0.756130	0.243870
10000 rows × 2 columnstop_topics = document_topics['topic 0'] > .8
document_topics[top_topics].head()     # 返回 topic 0 下概率大于 0.8 的行，也就是文本