Depending on how your data fits in memory, choose one of the following methods:
如果您的数据目前正在存储本地计算机的内存中,则可以使用distributed
函数从客户端工作区向并行池的工人分发现有数组。此选项可用于测试或执行在执行操作之前,从而显着增加阵列的大小,例如repmat.
。
If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use数据存储
与之distributed
function to read data into the memory of the workers of a parallel pool.
如果您的数据不适合群集的内存,则可以使用数据存储
withtall
arrays to partition and process your data in chunks. See alsoBig Data Workflow Using Tall Arrays and Datastores。
数据存储
If your data does not fit in the memory of your local machine, but does fit in the memory of your cluster, you can use数据存储
与之distributed
function to create distributed arrays and partition the data among your workers.
This example shows how to create and load distributed arrays using数据存储
。使用航空公司飞行数据的表格文件创建数据存储。此数据集太小,无法对工作人员进行平等分区。要模拟大数据集,人为地增加了数据存储的大小repmat.
。
files = repmat({'airlinesmall.csv'},10,1);ds = tabulartextdataStore(文件);
选择示例变量。
ds.SelectedVariableNames = {'DepTime','DepDelay'}; ds.TreatAsMissing ='na';
通过并行读取数据存储来创建分布式表。每个工作人员分区分区数据存储。然后,每个工作人员都读取来自相应分区的所有数据。文件必须位于工人可访问的共享位置。
dt = distributed(ds);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
显示有关分布式表的摘要信息。
摘要(DT)
Variables: DepTime: 1,235,230×1 double Values: min 1 max 2505 NaNs 23,510 DepDelay: 1,235,230×1 double Values: min -1036 max 1438 NaNs 23,510
Determine the size of the tall table.
尺寸(dt)
ans = 1235230 2
Return the first few rows ofdt
。
头(dt)
ans = DepTime DepDelay _______ ________ 642 12 1021 1 2055 20 1332 12 629 -1 1446 63 928 -2 859 -1 1833 3 1041 1
Finally, check how much data each worker has loaded.
spmd,dt,end
实验室1:这个工作人员存储DT2(1:370569,:)。localpart:[370569×2表]译码员:[1×1鳕鱼钢板1d]实验室2:这名工人存储DT2(370570:617615,:)。localpart:[247046×2表]译码员:[1×1鳕鱼灯推荐图]实验室3:这个工人存储DT2(617616:988184,:)。LocalPart:[370569×2表]译码员:[1×1鳕鱼ributor1d]实验室4:这名工人存储DT2(988185:1235230,:)。LocalPart:[247046×2表]译码员:[1×1鳕鱼铁器1d]
请注意,数据在工人上平均分区。有关更多细节数据存储
, seeWhat Is a Datastore?
有关大数据的工作流程的更多详细信息,请参阅Choose a Parallel Computing Solution。
如果您的数据适合本地计算机的内存,则可以使用分布式阵列分区工人之间的数据。使用distributed
功能要在MATLAB客户端中创建分布式数组,并将其数据存储在打开并行池的工人上。分布式阵列以一个尺寸分布,并且沿着工人之间的尺寸均匀地分布。创建分布式数组时,无法控制分发的详细信息。
您可以通过多种方式创建分布式数组:
使用distributed
函数从客户端工作区向并行池的工人分发现有数组。
使用任何的distributed
functions to directly construct a distributed array on the workers. This technique does not require that the array already exists in the client, thereby reducing client workspace memory requirements. Functions include
和eye
(___,'distributed')
。For a full list, see therand
(___,'distributed')distributed
对象参考页面。
Create a codistributed array inside anspmd
statement, and then access it as a distributed array outside thespmd
声明。This technique lets you use distribution schemes other than the default.
The first two techniques do not involvespmd
in creating the array, but you can usespmd
操纵以这种方式创建的数组。例如:
Create an array in the client workspace, and then make it a distributed array.
Parpool('local'2)% Create poolw =那些(6,6);w =分布(w);% Distribute to the workersspmdT = W*2;% Calculation performed on workers, in parallel.%t和w在这里都是编码阵列。endT%查看客户端的结果。whos% T and W are both distributed arrays here.delete(gcp)% Stop pool
或者,您可以使用codistributed
函数,允许您控制更多选项,例如维度和分区,但通常更复杂。你可以创建一个codistributed
array by executing on the workers themselves, either inside anspmd
statement or inside a communicating job. When creating acodistributed
array, you can control all aspects of distribution, including dimensions and partitions.
The relationship between distributed and codistributed arrays is one of perspective. Codistributed arrays are partitioned among the workers from which you execute code to create or manipulate them. When you create a distributed array in the client, you can access it as a codistributed array inside anspmd
声明。当您在一个中创建编码阵列时spmd
statement, you can access it as a distributed array in the client. Onlyspmd
statements let you access the same array data from two different perspectives.
你可以创建一个codistributed
array in several ways:
使用codistributed
在A.spmd
声明或通信作业,用于在运行该作业的工人上存在的编码数据。
使用任何的codistributed functions to directly construct a codistributed array on the workers. This technique does not require that the array already exists in the workers. Functions include
和eye
(___,'codistributed')
。For a full list, see therand
(___,'codistributed')codistributed
对象参考页面。
在外部创建分布式数组spmd
语句,然后将其作为一个编码阵列访问spmd
statement running on the same parallel pool.
Create a codistributed array inside anspmd
使用非默认分发方案的声明。首先,沿着第三维定义1-D分布,工人1上有4个零件,12件零件在工人2上。然后创建一个3×3×16阵列的零。
Parpool('local'2)% Create poolspmdcodist = codistributor1d(3,[4,12]); Z = zeros(3,3,16,codist); Z = Z + labindex;endZ%查看客户端的结果。%z在此处是分布式阵列。delete(gcp)% Stop pool
有关编码阵列的更多详细信息,请参阅使用编码阵列。
codistributed
|数据存储
|distributed
|eye
|rand
|repmat.
|spmd
|tall