Technical Articles and Newsletters

并行Matlab.: Multiple Processors and Multiple Cores

Mathworks的Cleve Moler


并行Matlab.: Multiple Processors and Multiple Cores

Twelve years ago, in the Spring of 1995, I wrote a Cleve’s Corner titled“Why There Isn’t a Parallel MATLAB.”That one-page article has become one of my most frequently referenced papers. At the time, I argued that the distributed memory model of most contemporary parallel computers was incompatible with theMATLABmemory model, that MATLAB spent only a small portion of its execution time on tasks that could be automatically parallelized, and that there were not enough potential customers to justify a significant development effort.

这situation is very different today. First, MATLAB has evolved from a simple “Matrix Laboratory” into a mature technical computing environment that supports large-scale projects involving much more than numerical linear algebra. Second, today’s microprocessors have two or four computational cores (we can expect even more in the future), and modern computers have sophisticated, hierarchical memory structures. Third, most MATLAB users now have access to clusters and networks of machines, and will soon have personal parallel computers.

As a result of all these changes, we now have并行Matlab.

Matlab支金宝app持三种并行性:多线程,分布式计算和显式并行性。这些不同类型可以共存 - 例如,分布式计算作业可能会在每台计算机上调用多线程功能,然后使用分布式数组来收集最终结果。对于多线程并行性,可以在MATLAB首选项面板中设置线程数。我们使用Intel Math内核库库,包括BLA的多线程版本(基本线性代数子程序)。对于Vector参数,Matlab基本函数库包括指数和三角函数,是多线程的。

选择要使用的平行形式可能是复杂的。这个Cleve的角落描述了一些与多线程和分布式行度相结合的实验。

三种类型的并行计算


多线程并行性。In multithreaded parallelism, one instance of MATLAB automatically generates multiple simultaneous instruction streams. Multiple processors or cores, sharing the memory of a single computer, execute these streams. An example is summing the elements of a matrix.

Distributed computing在分布式计算中,MATLAB的多个实例在单独的计算机上运行多个独立计算,每个计算机都有自己的存储器。几年前,我称之为这个非常常见而重要的并行性“令人尴尬的平行”,因为没有必要新的计算机科学。在大多数情况下,单个程序使用不同的参数或不同的随机数种子运行多次。

Explicit parallelism.在显式并行性中,MATLAB的几个实例在多个处理器或计算机上运行,​​通常具有单独的存储器,并同时执行单个MATLAB命令或M函数。新的编程构造,包括并行环和分布式阵列,描述了并行性。

并行计算群集

Figure 1 is a schematic of a typicalparallel computing簇。灰色盒子是单独的计算机,每个计算机都有自己的机箱,电源,光盘驱动器,网络连接和内存。浅蓝色盒子是微处理器。每个微处理器内的深蓝色盒子是计算核心。绿色盒子是主要的记忆。有几种不同的内存模型。在某些设计中,每个核心都有统一访问整个内存。在其他情况下,内存访问时间不统一,我们的绿色内存框可以分为两组或四个连接到每个处理器或核心的部分。

cc_parallel_matlab_fig1_w.gif
Figure 1. A typical parallel computing cluster.

At The MathWorks, we have several such clusters, running both Linux and Windows operating systems. One has 16 dual-processor, dual-core computers. Each machine has two AMD Opteron 285 processors, and each processor has two cores. Each computer also has four gigabytes of memory, which is shared by the four cores on that machine. We therefore have up to 64 separate computational streams but only 16 primary memories. We call these clusters the HPC lab, for High Performance or High Productivity Computing, even though they are certainly not supercomputers.

我们的HPC实验室拥有顶级超级计算机的一个重要优势:一个人可以接管整个机器进行交互式使用。但这是一种奢侈品。当几个人共享并行计算设施时,它们通常必须将作业提交给队列,以便在机器许可证上的时间和空间进行处理。大规模的互动计算很少见。

这first version ofDistributed Computing Toolbox, released in 2005, provided the capability of managing multiple, independent MATLAB jobs in such an environment. The second version, released in 2006, added a MATLAB binding of MPI, the industry standard for communication between jobs. Since MATLAB stands for “Matrix Laboratory,” we decided to call each instance of MATLAB a “lab” and introducednumlabs., the number of labs involved in a job.

Beginning with version 3.0 of Distributed Computing Toolbox, The MathWorks added support for new programming constructs that take MATLAB beyond the embarrassingly parallel, multiple jobs style of computing involved in this first benchmark.

Using Multithreaded Math Libraries Within Multiple MATLAB Tasks

这starting point for this experiment isbench.m的源代码benchexample. I removed the graphics tasks and all the report generating code, leaving just four computational tasks:

  • ODE—Use ODE45 to solve the van der Pol ordinary differential equation over a long time interval
  • FFT—Use FFTW to compute the Fourier transform of a vector of length 220
  • LU—Use LAPACK to factor a 1000-by-1000 real dense matrix
  • —Use A\b to solve a sparse linear system of order 66,603 with 331,823 nonzeros

我使用了分布式计算工具箱和matlabDistributed Computing Engineto run multiple copies of this stripped-downbench。这computation is embarrassingly parallel—once started, there is no communication between the tasks until the execution time results are collected at the end of the run. The computation is also multithreaded.

这plots in Figures 2 through 5 show the ratio of the execution time for one singly threaded task on one lab to the execution time of many multithreaded tasks on many labs. The horizontal line represents perfect efficiency—ptasks can be done onplabs in the same time that it takes to do one task on one lab. The first and most important point is that for all four tasks, the blue curve is indistinguishable from the horizontal line, up to 16 labs. This indicates that single-threaded tasks get perfect efficiency if there is at most one task per computer. This is hardly surprising. If it weren’t true for these embarrassingly parallel tasks, it would be a sign of a serious bug in our hardware, software, or measurements. With more than 16 labs, or more than one thread per lab, however, the performance is more omplicated. To see why, we need to look at each computation separately.

ode任务(图2)是许多MATLAB任务的典型:它涉及解释的M代码,重复函数调用,颂歌求解器的每个步骤的适度数据,以及许多步骤。通过这项任务,我们得到了完美的效率,甚至最多64个实验室。每个核心都可以处理一个实验室的所有计算。内存需求不占主导地位。使用多个线程没有效果​​,因为任务没有访问任何多线程库。事实上,目前尚不清楚如何使用多线程。

cc_parallel_matlab_fig2_w.gif
Figure 2. Execution speed ratio for the ODE task.

FFT任务(图3)涉及长度的向量n = 220。这n log n复杂性意味着每个矢量元素只有少数算术运算。任务几乎完全效率为16个实验室,但超过16个比率恶化,因为多个核心无法从内存中获取数据。64个实验室完成64个任务需要的时间约为40%,而不是一个实验室完成一个任务的时间。同样,多个线程没有效果​​,因为我们不使用多线程FFT库。

cc_parallel_matlab_fig3_w.gif
Figure 3. Execution speed ratio for the FFT task.

这blue line in the LU plot (Figure 4) shows that with just one thread per lab we get good but not quite perfect efficiency, up to 64 labs. The time it takes 64 labs to complete 64 tasks is only about 6% more than the time it takes one lab to do one task. The matrix order isn = 1000。这n2storage is about the same as the FFT task, but then3complexity implies that each element is reused many times. The underlying LAPACK factorization algorithm makes effective use of cache, so the fact that there are only 16 primary memories does not adversely affect the computation time.

cc_parallel_matlab_fig4_w.gif
Figure 4. Execution speed ratio for the LU task and different levels of multithreading.

这green and red lines in the LU plot show that using two or four threads per lab is an advantage as long as the number of threads times the number of labs does not exceed the number of cores. With this restriction, two threads per lab run about 20% faster than one thread, and four threads per lab run about 60% faster than one thread. These percentages would be larger with larger matrices and smaller with smaller matrices. With more than 64 threads—that is, more than 32 labs using two threads per lab, or more than 16 labs using four threads per lab—the multithreading becomes counterproductive.

Results for the sparse task (Figure 5) are the least typical but perhaps the most surprising. The blue line again shows that with just one thread per lab, we obtain good efficiency. However, the red and green lines show that multithreading is always counterproductive—at least for this particular matrix. CHOLMOD, a supernodal sparse Cholesky linear equation solver developed by Tim Davis, was introduced into MATLAB recently, but before we were concerned about multithreading. The algorithm switches from sparse data structures, which do not use the BLAS, to dense data structures and the multithreaded BLAS when the number of floating-point operations per nonzero matrix element exceeds a specified threshold. The supernodal algorithm involves operations on highly rectangular matrices. It appears that the BLAS do not handle such matrices well. We need to reexamine both CHOLMOD and the BLAS to make them work more effectively together.

cc_parallel_matlab_fig5_w.gif
Figure 5. Execution speed ratio for the sparse task and different levels of multithreading.

Published 2007 - 91467v00

查看相关的文章d Capabilities

查看相关的文章d Industries