机器学习简介，第2部分：无监督机器学习

Seth DeLand, MathWorks

概述无监督的机器学习，它在数据集中寻找没有标记响应的数据集模式。当您想要探索您的数据但尚未拥有特定目标时，您会使用此技术，或者您不确定数据包含的信息。It’s also a good way to reduce the dimensionality of your data.

Most unsupervised learning techniques are a form of cluster analysis. Clustering algorithms fall into two broad groups:

Hard clustering, where each data point belongs to only one cluster
Soft clustering, where each data point can belong to more than one cluster

此视频使用示例来说明硬群和软群算法，它显示为什么要使用无监督的机器学习，以减少数据集中的功能数量。

Unsupervised machine learning looks for patterns in datasets that don’t have labeled responses.

当您想要探索您的数据但尚未拥有特定目标时，您会使用此技术，或者您不确定数据包含的信息。

这也是减少数据维度的好方法。

As we’ve previously discussed, most unsupervised learning techniques are a form of cluster analysis, which separates data into groups based on shared characteristics.

聚类算法落入两组广泛的组：

Hard clustering, where each data point belongs to only one cluster
Soft clustering, where each data point can belong to more than one cluster

For context, here’s a hard clustering example:

说你是一名工程师建设手机塔。您需要决定在哪里以及塔楼的位置。为了确保您提供最佳的信号接收，您需要在人群中找到塔。

要启动，您需要在群集数量的次数中初次猜测。为此，比较有三个塔楼和四座塔的场景，看看每个都能提供服务。

因为手机一次只能与一座塔通话，这是一个硬的聚类问题。

For this, you could use k-means clustering, because the k-means algorithm treats each observation in the data as an object having a location in space. It finds cluster centers, or means, that reduce the total distance from data points to their cluster centers.

所以，这是艰苦的聚类。让我们看看如何在现实世界中使用软聚类算法。

Pretend you’re a biologist analyzing the genes involved in normal and abnormal cell division. You have data from two tissue samples, and you want to compare them to determine whether certain patterns of gene features correlate to cancer.

因为相同的基因可以参与若干生物学过程，所以没有单个基因仅可能属于一种簇。

Apply a fuzzy c-means algorithm to the data, and then visualize the clusters to see which groups of genes behave in similar ways.

然后，您可以使用此模型来帮助了解与正常或异常单元分区相关的功能。

这涵盖了两个主要技术（硬群和软群），用于探索具有未标记响应的数据。

Remember though, that you can also use unsupervised machine learning to reduce the number of features, or the dimensionality, of your data.

您可以执行此操作，使您的数据更加复杂 - 特别是如果您正在使用具有数百或数千个变量的数据。通过降低数据的复杂性，您可以专注于重要的功能并获得更好的见解。

让我们来看看3个常见的维度减少算法：

Principal Component Analysis (PCA) performs a linear transformation on the data so that most of the variance in your dataset is captured by the first few principal components. This could be useful for developing condition indicators for machine health monitoring.
Factor Analysis identifies underlying correlations between variables in your dataset. It provides a representation of unobserved latent, or common, factors. Factor analysis is sometimes used to explain stock price variation.

当模型术语必须代表非负数量，例如物理量时，使用非负矩阵分解。如果您需要在网页或文档上比较大量文本，这将是一个很好的方法，以便文本不存在，或者发生正常的次数。

In this video, we took a closer look at hard and soft clustering algorithms, and we also showed why you’d want to use unsupervised machine learning to reduce the number of features in your dataset.

As for your next steps:

Unsupervised learning might be your end goal. If you’re just looking to segment data, a clustering algorithm is an appropriate choice.

On the other hand, you might want to use unsupervised learning as a dimensionality reduction step for supervised learning. In our next video we’ll take a closer look at supervised learning.

现在，它包装了这个视频。不要忘记查看以下描述以获取更多资源和链接。

产品焦点

Statistics and Machine Learning Toolbox

Series:机器学习简介

2:37

第1部分：机器学习基础Explore the fundamentals behind machine learning, focusing on unsupervised and supervised learning. Learn about the common techniques, including clustering, classification, and regression.