Main Content

Process Big Data in the Cloud

This example shows how to access a large data set in the cloud and process it in a cloud cluster using MATLAB capabilities for big data.

Learn how to:

  • 访问Amazon Cloud上的公共数据集。

  • Find and select an interesting subset of this data set.

  • 使用数据存储,高阵列和并行计算工具箱在不到20分钟内处理此子集。

The public data set in this example is part of the Wind Integration National Dataset Toolkit, or WIND Toolkit [1], [2], [3], [4]. For more information, seeWind Integration National Dataset Toolkit

Requirements

To run this example, you must set up access to a cluster in Amazon AWS. In MATLAB, you can create clusters in Amazon AWS directly from the MATLAB desktop. On theHometab, in theParallelmenu, selectCreate and Manage Clusters。In the Cluster Profile Manager, click创建云集群。Alternatively, you can use MathWorks Cloud Center to create and access compute clusters in Amazon AWS. For more information, see云中心入门

Set Up Access to Remote Data

The data set used in this example is the Techno-Economic WIND Toolkit. It contains 2 TB (terabyte) of data for wind power estimates and forecasts along with atmospheric variables from 2007 to 2013 within the continental U.S.

技术经济风格工具包通过亚马逊Web服务提供,在该位置S3:// NREL-PDS-WTK / WTK-Techno-Technoluit / PyWTK数据。它包含两个数据集:

  • S3:// NREL-PDS-WTK / WTK-Techno-Technoluit / PyWTK-Data / Met_data- Metrology Data

  • s3://nrel-pds-wtk/wtk-techno-economic/pywtk-data/fcst_data- Forecast Data

To work with remote data in Amazon S3, you must define environment variables for your AWS credentials. For more information on setting up access to remote data, see使用远程数据。In the following code, replaceYOUR_AWS_ACCESS_KEY_IDandYOUR_AWS_SECRET_ACCESS_KEYwith your own Amazon AWS credentials. If you are using temporary AWS security credentials, also set the environment variableAWS_SESSION_TOKEN

setenv(“aws_access_key_id”"YOUR_AWS_ACCESS_KEY_ID");setenv(“aws_secret_access_key”“your_aws_secret_access_key”);

This data set requires you to specify its geographic region, and so you must set the corresponding environment variable.

setenv(“aws_default_region”"us-west-2");

要使工人在群集访问远程数据中,请将这些环境变量名称添加到EnvironmentVariables群集配置文件的属性。要编辑群集配置文件的属性,请使用群集配置文件管理器Parallel>Create and Manage Clusters。有关更多信息,请参阅在工人设置环境变量

Find Subset of Big Data

The 2 TB data set is quite large. This example shows you how to find a subset of the data set that you want to analyze. The example focuses on data for the state of Massachusetts.

首先获得id识别计量l stations in Massachusetts, and determine the files that contain their metrological information. Metadata information for each station is in a file namedthree_tier_site_metadata.csv.。Because this data is small and fits in memory, you can access it from the MATLAB client withreadtable。You can use thereadtable功能直接访问S3存储桶中的打开数据,无需编写特殊代码。

tmetadata = readtable(“s3://nrech-pds-wtk/wtk-techno-economic/pywtk-data/three_tier_site_metadata.csv”......"ReadVariableNames",真的,“texttype”“细绳”);

要了解此数据集中列出了哪些状态,请使用独特

=唯一(tmetadata.state)
states =50×1 string array“”阿拉巴马州“”亚利桑那州“”阿肯色州“”加州“”科罗拉多“”康涅狄格“”哥伦比亚特区“”佛罗里达“”乔治亚州“”爱达荷“”伊利诺伊“”印第安纳“”爱荷华州“”堪萨斯“”堪萨斯“”堪萨斯“”肯塔基州“”路易斯安那州“”马里兰“”马萨诸塞州“”密歇根州“”明尼苏达“”密西西比“”蒙大拿“”内布拉斯“”新汉普郡“”新泽西“”新墨西哥“”纽约“" "North Carolina" "North Dakota" "Ohio" "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" "South Dakota" "Tennessee" "Texas" "Utah" "Vermont" "Virginia" "Washington“”西弗吉尼亚“”威斯康星州“”怀俄明“

确定哪个车站位于马萨诸塞州的状态。

index = tMetadata.state =="Massachusetts"; siteId = tMetadata{index,“site_id”};

给定站的数据包含在此命名约定的文件中:S3:// NREL-PDS-WTK / WTK-Techno-Technoluit / PyWTK-Data / Met_data/folder/site_id.nc那where文件夹是最接近的整数小于或等于site_id / 500.。Using this convention, compose a file location for each station.

文件夹=楼层(Siteid / 500);fileLocations = compose(“s3://nrech-pds-wtk/wtk-techno-economic/pywtk-data/met_data/%d/%d.nc”,文件夹,siteid);

Process Big Data

You can use datastores and tall arrays to access and process data that does not fit in memory. When performing big data computations, MATLAB accesses smaller portions of the remote data as needed, so you do not need to download the entire data set at once. With tall arrays, MATLAB automatically breaks the data into smaller blocks that fit in memory for processing.

If you have Parallel Computing Toolbox, MATLAB can process the many blocks in parallel. The parallelization enables you to run an analysis on a single desktop with local workers, or scale up to a cluster for more resources. When you use a cluster in the same cloud service as the data, the data stays in the cloud and you benefit from improved data transfer times. Keeping the data in the cloud is also more cost-effective. This example ran in less than 20 minutes using 18 workers on a c4.8xlarge machine in Amazon AWS.

If you use a parallel pool in a cluster, MATLAB processes this data using workers in the cluster. Create a parallel pool in the cluster. In the following code, use the name of your cluster profile instead. Attach the script to the pool, because the parallel workers need to access a helper function in it.

p = parpool("myAWSCluster");
使用“MyAwscluster”配置文件启动并行池(Parpool)...连接到18名工人。
addattachedfiles(p,mfilename("fullpath"的));

Create a datastore with the metrology data for the stations in Massachusetts. The data is in the form of Network Common Data Form (NetCDF) files, and you must use a custom read function to interpret them. In this example, this function is namedncReaderand reads the NetCDF data into timetables. You can explore its contents at the end of this script.

dsMetrology = fileDatastore (fileLocations,"ReadFcn"那@ncReader,"UniformRead",真的);

使用数据存储区的计量数据创建高节奏时间表。

ttmetrology = tall(dsmettology)
ttMetrology = M×6 tall timetable Time wind_speed wind_direction power density temperature pressure ____________________ __________ ______________ ______ _______ ___________ ________ 01-Jan-2007 00:00:00 5.905 189.35 3.3254 1.2374 269.74 97963 01-Jan-2007 00:05:00 5.8898 188.77 3.2988 1.2376 269.73 97959 01-Jan-2007 00:10:00 5.9447 187.85 3.396 1.2376 269.71 97960 01-Jan-2007 00:15:00 6.0362 187.05 3.5574 1.2376 269.68 97961 01-Jan-2007 00:20:00 6.1156 186.49 3.6973 1.2375 269.83 97958 01-Jan-2007 00:25:00 6.2133 185.71 3.8698 1.2376 270.03 97952 01-Jan-2007 00:30:00 6.3232 184.29 4.0812 1.2379 270.19 97955 01-Jan-2007 00:35:00 6.4331 182.51 4.3382 1.2382 270.3 97957 : : : : : : : : : : : : : :

Get the mean temperature per month usinggroupsummary那and sort the resulting tall table. For performance, MATLAB defers most tall operations until the data is needed. In this case, plotting the data triggers evaluation of deferred calculations.

MeanTemperature = Globanummary(TTMetrology,"Time"“月”“意思是”"temperature");MeanTemperature = Sortrows(manimtemperature);

绘制结果。

数字;绘图(ManiCTemperature.mean_temperature,"*-");ylim([260 300]); xlim([1 12*7+1]); xticks(1:12:12*7+1); xticklabels(["2007""2008""2009""2010""2011"“2012”“2013”“2014”]);标题(“马萨诸塞州的平均气温2007-2013”);Xlabel("Year");ylabel("Temperature (K)"的)

Many MATLAB functions support tall arrays, so you can perform a variety of calculations on big data sets using familiar syntax. For more information on supported functions, see金宝app支持功能

定义自定义读取功能

The data in the Techno-Economic WIND Toolkit is saved in NetCDF files. Define a custom read function to read its data into a timetable. For more information on reading NetCDF files, seeNetCDF Files

functiont = ncReader(filename)%ncreader读取netcdf文件(.nc),提取数据集并另存为时间表% Get information about NetCDF data sourcefileInfo = ncinfo(filename);% Extract variable names and datatypesvarnames = string({fileinfo.variables.name});vartypes = string({fileinfo.variables.datatype});%将变量名称转换为表变量的有效名称if任何(startswith(varnames,[“4”“6”]))strvarnames =替换(varnames,[“4”“6”],["four"“六”]);elsestrVarNames = varNames;end%提取每个变量的长度fileLength = fileinfo.dimensions.length;%提取初始时间戳,采样周期并创建时间轴tattributes = struct2table(fileinfo.attributes);starttime = datetime(cell2mat(tattributes.value(包含(tattributes.name,"start_time"))),“convertfrom”“epochtime”);samplePeriod = seconds(cell2mat(tAttributes.Value(contains(tAttributes.Name,"sample_period"的)的)的));% Create the output timetablenumVars = numel(strVarNames); tableSize = [fileLength numVars]; t = timetable('Size',拨打,'VariableTypes'那varTypes,'variablenames',strvarnames,'TimeStep',sampleperiod,'开始时间',开始时间);%填写可变数据的时间表为了k = 1:numvars t(:,k)=表(ncread(filename,varnames {k}));endend

References

[1] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa.Overview and Meteorological Validation of the Wind Integration National Dataset Toolkit(Technical Report, NREL/TP-5000-61740). Golden, CO: National Renewable Energy Laboratory, 2015.

[2] Draxl, C., B. M. Hodge, A. Clifton, and J. McCaa. "The Wind Integration National Dataset (WIND) Toolkit."Applied Energy。Vol. 151, 2015, pp. 355-366.

[3] King, J., A. Clifton, and B. M. Hodge.Validation of Power Output for the WIND Toolkit(Technical Report, NREL/TP-5D00-61714). Golden, CO: National Renewable Energy Laboratory, 2014.

[4] Lieberman-cribbin,W.,C. Draxl和A.Clifton。Guide to Using the WIND Toolkit Validation Code(Technical Report, NREL/TP-5000-62595). Golden, CO: National Renewable Energy Laboratory, 2014.

也可以看看

|||

相关例子

更多关于