主要内容

并行统计计算中的可重复性

Issues and Considerations in Reproducing Parallel Computations

A可再现computation is one that gives the same results every time it runs. Reproducibility is important for:

  • 调试 - 要纠正异常结果,您需要重现结果。

  • Confidence — When you can reproduce results, you can investigate and understand them.

  • Modifying existing code — When you change existing code, you want to ensure that you do not break anything.

Generally, you do not need to ensure reproducibility for your computation. Often, when you want reproducibility, the simplest technique is to run in serial instead of in parallel. In serial computation you can simply call theRNG作用如下:

s = rng%获得随机流的当前状态运行统计函数rng(s)%将流重置为先前状态%再次运行统计函数,获得相同的结果

This section addresses the case when your function uses random numbers, and you want reproducible results in parallel. This section also addresses the case when you want the same results in parallel as in serial.

Running Reproducible Parallel Computations

To run a Statistics and Machine Learning Toolbox™ function reproducibly:

  1. Set theUseSubstreams选项真的usingStatset.

  2. Set the选项a type that supports substreams:'mlfg6331_64'或者'mrg32k3a'. For information on these streams, seeRandStream.list.

  3. To compute in parallel, set theUseParallel选项真的.

  4. To fit an ensemble in parallel usingfitcensemble或者fitrensemble,用'Reproducible'名称值对设置为真的:

    t = Templatetree('Reproducible',真的);ens = fitCensemble(x,y,'Method','包','Learners',t,...'Options',options);
  5. 使用选项结构调用功能。

  6. 要重现计算,请重置流,然后再次调用函数。

To understand why this technique gives reproducibility, seeHow Substreams Enable Reproducible Parallel Computations.

For example, to use the'mlfg6331_64'stream for reproducible computation:

  1. 创建适当的选项结构:

    s = RandStream('mlfg6331_64'); options = statset('UseParallel',true,...“流”,S,'UseSubstreams',真的);
  2. 运行并行计算。有关说明,请参阅快速启动统计和机器学习工具箱的平行计算.

  3. 重置随机流:

    reset(s);
  4. Rerun your parallel computation. You obtain identical results.

有关并行计算的示例运行以这种可重复的方式,请参见Reproducible Parallel Bootstrap火车分类合奏并联.

使用随机数的并行统计计算

What Are Substreams?

Asubstream是随机流的一部分RandStream可以快速访问。有一个数字Msuch that for any positive integerk,RandStreamcan go to the公里流中的伪数字。从那时起,RandStream可以在流中生成后续条目。目前,RandStreamhasM= 272,约5E21或更多。

不同基因中的条目具有良好的统计属性,类似于单个流中条目的特性:独立性和缺乏k-way correlation at various lags. The substreams are so long that you can view the substreams as being independent streams, as in the following picture.

TwoRandStreamstream types support substreams:'mlfg6331_64''mrg32k3a'.

How Substreams Enable Reproducible Parallel Computations

When MATLAB®performs computations in parallel withparfor,每个工人都以不可预测的顺序收到循环迭代。因此,您无法预测哪个工人得到哪个迭代,因此无法确定与每次迭代相关的随机数。

Substreams允许MATLAB领带每个迭代particular sequence of random numbers.parforgives each iteration an index. The iteration uses the index as the substream number. Since the random numbers are associated with the iterations, not with the workers, the entire computation is reproducible.

为了获得可重现的结果,只需重置流,所有键 - 再次调用时就会产生相同的随机数。当所有工人使用相同的流并支持子流时,此方法会成功。金宝app这是关于如何在Running Reproducible Parallel Computations给出可重现的并行结果。

Random Numbers on the Client or Workers

A few functions generate random numbers on the client before distributing them to parallel workers. The workers do not use random numbers, so operate purely deterministically. For these functions, you can run a parallel computation reproducibly using any random stream type.

The functions that operate this way include:

To obtain identical results, reset the random stream on the client, or the random stream you pass to the client. For example:

s = rng%获得随机流的当前状态运行统计函数rng(s)%将流重置为先前状态%再次运行统计函数,获得相同的结果

While this method enables you to run reproducibly in parallel, the results can differ from a serial computation. The reason for the difference isparfor循环以相反的顺序运行forloops. Therefore, a serial computation can generate random numbers in a different order than a parallel computation. For unequivocal reproducibility, use the technique inRunning Reproducible Parallel Computations.

Distributing Streams Explicitly

要使用特定的随机数算法进行测试或比较,必须设置随机数生成器。您如何并行设置这些发电机,或以特定方式初始化每个工人的流?或者,您可能想使用与您运行的任何其他随机数的序列进行计算。如何确保使用的序列在统计上是独立的?

并行统计和机器学习工具箱功能允许您明确地在每个工人上设置随机流。有关信息creating多个流,输入help RandStream/createat the command line. To create four independent streams using the'mrg32k3a'发电机:

s = randstream.create('mrg32k3a','numstreams',4,...'cellOutput',true);

使用这些流将这些流传递到统计函数option. For example:

parpool(4)%如果您至少有4个核心s = randstream.create('mrg32k3a','numstreams',4,...'cellOutput',true);%创建4个独立流paroptions = statset('useParallel',true,...'streams',s);%设置4个不同的流X = [Randn(700,1);4 + 2*Randn(300,1)];latt = -4:0.01:12;myfun = @(x)ksdentens(x,latt);pdfestimate = myFun(x);b = bootstrp(200,myfun,x,'options',paroptions);

This method of distributing streams gives each worker a different stream for the computation. However, it does not allow for a reproducible computation, because the workers perform the 200 bootstraps in an unpredictable order. If you want to perform a reproducible computation, use substreams as described inRunning Reproducible Parallel Computations.

If you set theUseSubstreams选项真的, then set the单个随机流的选项支持子流的类型(金宝app'mlfg6331_64'或者'mrg32k3a')。此设置提供可重复的计算。