主要内容

将阵列分发给平行工人

Using Distributed Arrays to Partition Data Across Workers

Depending on how your data fits in memory, choose one of the following methods:

  • 如果您的数据目前正在存储本地计算机的内存中,则可以使用distributed函数从客户端工作区向并行池的工人分发现有数组。此选项可用于测试或执行在执行操作之前,从而显着增加阵列的大小,例如repmat.

  • If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use数据存储与之distributedfunction to read data into the memory of the workers of a parallel pool.

  • 如果您的数据不适合群集的内存,则可以使用数据存储withtallarrays to partition and process your data in chunks. See alsoBig Data Workflow Using Tall Arrays and Datastores

Load Distributed Arrays in Parallel Using数据存储

If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use数据存储与之distributedfunction to create distributed arrays and partition the data among your workers.

This example shows how to create and load distributed arrays using数据存储。使用航空公司飞行数据的表格文件创建数据存储。此数据集太小,无法对工作人员进行平等分区。要模拟大数据集,人为地增加了数据存储的大小repmat.

files = repmat({'airlinesmall.csv'},10,1);ds = tabulartextdataStore(文件);

选择示例变量。

ds.SelectedVariableNames = {'DepTime','DepDelay'}; ds.TreatAsMissing ='na';

通过并行读取数据存储来创建分布式表。每个工作人员分区分区数据存储。然后,每个工作人员都读取来自相应分区的所有数据。文件必须位于工人可访问的共享位置。

dt = distributed(ds);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.

显示有关分布式表的摘要信息。

摘要(DT)
Variables: DepTime: 1,235,230×1 double Values: min 1 max 2505 NaNs 23,510 DepDelay: 1,235,230×1 double Values: min -1036 max 1438 NaNs 23,510

Determine the size of the tall table.

尺寸(dt)
ans = 1235230 2

Return the first few rows ofdt

头(dt)
ans = DepTime DepDelay _______ ________ 642 12 1021 1 2055 20 1332 12 629 -1 1446 63 928 -2 859 -1 1833 3 1041 1

Finally, check how much data each worker has loaded.

spmd,dt,end
实验室1:这个工作人员存储DT2(1:370569,:)。localpart:[370569×2表]译码员:[1×1鳕鱼钢板1d]实验室2:这名工人存储DT2(370570:617615,:)。localpart:[247046×2表]译码员:[1×1鳕鱼灯推荐图]实验室3:这个工人存储DT2(617616:988184,:)。LocalPart:[370569×2表]译码员:[1×1鳕鱼ributor1d]实验室4:这名工人存储DT2(988185:1235230,:)。LocalPart:[247046×2表]译码员:[1×1鳕鱼铁器1d]

请注意,数据在工人上平均分区。有关更多细节数据存储, seeWhat Is a Datastore?

有关大数据的工作流程的更多详细信息,请参阅Choose a Parallel Computing Solution

用于创建分布式和编码阵列的替代方法

如果您的数据适合本地计算机的内存,则可以使用分布式阵列分区工人之间的数据。使用distributed功能要在MATLAB客户端中创建分布式数组,并将其数据存储在打开并行池的工人上。分布式阵列以一个尺寸分布,并且沿着工人之间的尺寸均匀地分布。创建分布式数组时,无法控制分发的详细信息。

您可以通过多种方式创建分布式数组:

  • 使用distributed函数从客户端工作区向并行池的工人分发现有数组。

  • 使用任何的distributedfunctions to directly construct a distributed array on the workers. This technique does not require that the array already exists in the client, thereby reducing client workspace memory requirements. Functions includeeye(___,'distributed')rand(___,'distributed')。For a full list, see thedistributed对象参考页面。

  • Create a codistributed array inside anspmdstatement, and then access it as a distributed array outside thespmd声明。This technique lets you use distribution schemes other than the default.

The first two techniques do not involvespmdin creating the array, but you can usespmd操纵以这种方式创建的数组。例如:

Create an array in the client workspace, and then make it a distributed array.

Parpool('local'2)% Create poolw =那些(6,6);w =分布(w);% Distribute to the workersspmdT = W*2;% Calculation performed on workers, in parallel.%t和w在这里都是编码阵列。endT%查看客户端的结果。whos% T and W are both distributed arrays here.delete(gcp)% Stop pool

或者,您可以使用codistributed函数,允许您控制更多选项,例如维度和分区,但通常更复杂。你可以创建一个codistributedarray by executing on the workers themselves, either inside anspmdstatement or inside a communicating job. When creating acodistributedarray, you can control all aspects of distribution, including dimensions and partitions.

The relationship between distributed and codistributed arrays is one of perspective. Codistributed arrays are partitioned among the workers from which you execute code to create or manipulate them. When you create a distributed array in the client, you can access it as a codistributed array inside anspmd声明。当您在一个中创建编码阵列时spmdstatement, you can access it as a distributed array in the client. Onlyspmdstatements let you access the same array data from two different perspectives.

你可以创建一个codistributedarray in several ways:

  • 使用codistributed在A.spmd声明或通信作业,用于在运行该作业的工人上存在的编码数据。

  • 使用任何的codistributed functions to directly construct a codistributed array on the workers. This technique does not require that the array already exists in the workers. Functions includeeye(___,'codistributed')rand(___,'codistributed')。For a full list, see thecodistributed对象参考页面。

  • 在外部创建分布式数组spmd语句,然后将其作为一个编码阵列访问spmdstatement running on the same parallel pool.

Create a codistributed array inside anspmd使用非默认分发方案的声明。首先,沿着第三维定义1-D分布,工人1上有4个零件,12件零件在工人2上。然后创建一个3×3×16阵列的零。

Parpool('local'2)% Create poolspmdcodist = codistributor1d(3,[4,12]); Z = zeros(3,3,16,codist); Z = Z + labindex;endZ%查看客户端的结果。%z在此处是分布式阵列。delete(gcp)% Stop pool

有关编码阵列的更多详细信息,请参阅使用编码阵列

See Also

|||||||

Related Examples

More About