工业机器人入门实用教程
Following on from my earlier post on Data Science, here I will try to summarize and compile the major practical concepts of Machine Learning in a handy, easy to use, language-agnostic, reference guide format. Most of the information is presented as short and succinct bullet points. I expect this to be especially valuable to beginners or as a quick look-up for those with a basic level of experience in data science and machine learning.
在我之前关于数据科学的文章之后,我将在这里尝试以一种方便,易于使用,与语言无关的参考指南格式总结和编译机器学习的主要实用概念。 大多数信息以简短的要点表示。 我希望这对初学者特别有用,或者对于那些具有数据科学和机器学习基础知识的人来说是快速查找。
入门概念 (Introductory Concepts)
Let us get some basic terminologies out of our way first:
让我们首先摆脱一些基本术语:
Structured data refers to data stored in a predefined format, e.g., tables, spreadsheets or relational databases
结构化数据是指以预定义格式存储的数据,例如表,电子表格或关系数据库
Unstructured data, on the other hand, does not have a predefined format and therefore, cannot be saved in a tabular form. Unstructured data can come in a variety of types, e.g., blobs of text, images, videos, audio files
另一方面, 非结构化数据没有预定义的格式,因此无法以表格形式保存。 非结构化数据可以有多种类型,例如,文本,图像,视频,音频文件的斑点
Categorical data is any data that can be labeled and usually comprises of a range of fixed values, e.g., gender, nationality, risk grades. Categorical data can be either nominal (without any inherent ordering, e.g., gender) or ordinal (ordered or ranked data, e.g., risk grades). These fixed values are known as classes or categories
分类数据是可以被标记的任何数据,通常包含一系列固定值,例如性别,国籍,风险等级。 分类数据可以是名义数据(没有任何固有的排序,例如性别)或排序数据(排序或排序的数据,例如风险等级)。 这些固定值称为类或类别
Features or Predictors: input data/variables used by an ML model, usually denoted with
X
, to predict the target variable特征或预测变量:ML模型使用的输入数据/变量,通常用
X
表示,以预测目标变量Target variable: the data point that we want to predict by an ML model, often represented with
y
目标变量 :我们要通过ML模型预测的数据点,通常用
y
表示Classification problem involves predicting a discrete class of a categorical target variable, e.g., spam or not, default or non-default
分类问题涉及预测分类目标变量的离散类别,例如,是否为垃圾邮件,默认还是非默认
Regression problem deals with predicting a continuous numeric value, e.g., sales, house price
回归问题涉及预测连续的数值,例如销售,房价
Feature Engineering: transforming existing features or engineering new input features that can potentially be more useful during model training. E.g., calculating the number of months from today for a date variable
特征工程 :转换现有特征或工程新的输入特征,这些特征在模型训练期间可能会更加有用。 例如,为日期变量计算从今天起的月数
Training, Validation & Test data: Training data is used during initial model training/fitting. Validation data is used to evaluate the model, usually to fine-tune model parameters or identify the most suitable ML model among many. Test data is used for the final evaluation of a short-listed or fine-tuned model
训练,验证和测试数据 :训练数据用于初始模型训练/拟合。 验证数据用于评估模型,通常用于微调模型参数或识别众多模型中最合适的ML模型。 测试数据用于最终入围或微调模型的最终评估
Overfitting happens when a model performs well on the training data but poorly on the test/validation data, i.e., fails to generalize adequately on new and unseen data
当模型在训练数据上表现良好但在测试/验证数据上表现不佳时,即无法在新的和看不见的数据上充分归纳,就会发生过度拟合
Underfitting occurs when the model is not complex and robust enough that it is unable to learn the variable relationships from the training data, and has low accuracy even when applied on the data on which it was trained
当模型不够复杂且不够健壮以至于无法从训练数据中学习变量关系,并且即使将其应用于训练数据时,其准确性也很低,就会发生欠拟合
Model bias and variance: A model is said to be biased when it performs poorly on the training dataset as a result of underfitting. Variance is associated with how well or poorly the model performs on the test/validation set, with a high variance being usually caused by overfitting
模型偏差和方差 :当模型由于训练不足而在训练数据集上表现不佳时,被认为是偏差的。 方差与模型在测试/验证集上的表现有多好或差有关,通常由过度拟合导致高方差
Generalization, closely related to overfitting and model variance, refers to a model’s ability to make correct predictions on new, previously unseen data
泛化与过度拟合和模型方差密切相关,是指模型针对新的,以前看不见的数据做出正确预测的能力
Regularization techniques improve the generalizability of a model, e.g., through penalizing or shrinking regression coefficients towards zero
正则化技术例如通过将回归系数降低或缩小至零来提高模型的通用性
Ensemble Learning is a modeling technique that combines multiple models into one
集成学习是一种将多种模型组合为一个模型的建模技术
Baseline Model is a naive model/heuristic used as a reference point to evaluate a conventional ML model
基线模型是一种天真的模型/启发式方法,用作评估常规ML模型的参考点
Hyperparameters are the specific model parameters that can be tweaked during model training
超参数是在模型训练期间可以调整的特定模型参数