主要内容

Work with Remote Data

您可以读取和写入数据从远程位置using MATLAB®functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:

  • Amazon S3™ (Simple Storage Service)

  • 天蓝色®BLOB存储(以前称为Windows Azure®Storage Blob (WASB))

  • Hadoop®Distributed File System (HDFS™)

Amazon S3

MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form

s3://bucketname/path_to_file

bucketnameis the name of the container andpath_to_fileis the path to the file or folders.

Amazon S3provides data storage through web services interfaces. You can use abucketas a container to store objects in Amazon S3.

Set Up Access

To work with remote data in Amazon S3, you must set up access first:

  1. 注册Amazon Web Services(AWS)根帐户。看Amazon Web Services: Account.

  2. 使用您的AWS root帐户,创建一个IAM(身份和访问管理)用户。看Creating an IAM User in Your AWS Account.

  3. Generate an access key to receive an access key ID and a secret access key. SeeManaging Access Keys for IAM Users.

  4. 使用AWS访问密钥ID,秘密访问密钥和区域配置计算机,使用AWS命令行接口工具https://aws.amazon.com/cli/. Alternatively, set the environment variables directly by usingsetenv:

    • AWS_ACCESS_KEY_IDaws_secret_access_key- 身份验证并启用亚马逊S3服务。(您在步骤3中生成了这对访问密钥变量。)

    • AWS_DEFAULT_REGION(optional) — Select the geographic region of your bucket. The value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually.

    • aws_session_token(optional) — Specify the session token if you are using temporary security credentials, such as with AWS®联合身份验证。

If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You can copy your client environment variables to the workers on a cluster by settingEnvironmentVariablesinparpool,batch,createJob, or in the Cluster Profile Manager.

从中读取数据Amazon S3

以下示例显示了如何使用nImageDatastoreobject to read a specified image from Amazon S3, and then display the image to screen.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)

Write Data toAmazon S3

以下示例显示了如何使用Tabulartextdatastoreobject to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting, and then write it back to Amazon S3.

setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);

To read your tall data back, use the数据storefunction.

ds = datastore('s3://bucketname/preprocessedData/'); tt = tall(ds);

天蓝色斑点存储

MATLAB可让您使用Azure Blob存储进行在线文件存储。指定数据的位置时,必须使用表单的统一资源定位器(URL)指定文件或文件夹的完整路径

wasbs://container@account/path_to_file/file.ext

container@accountis the name of the container andpath_to_fileis the path to the file or folders.

天蓝色provides data storage through web services interfaces. You can use a斑点to store data files in Azure. SeeIntroduction to天蓝色for more information.

Set Up Access

使用远程数据在Azure存储,你亩t set up access first:

  1. Sign up for a Microsoft Azure account, seeMicrosoft Azure Account.

  2. 通过精确设置以下两个环境变量之一,设置您的身份验证详细信息setenv:

    • mw_wasb_sas_token— Authentication via Shared Access Signature (SAS)

      获得SAS。有关详细信息,请参见“获取blob容器的SAS”部分使用存储资源管理器管理Azure Blob存储资源.

      In MATLAB, setmw_wasb_sas_token到SAS查询字符串。例如,

      setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'

      您必须将此字符串设置为从Azure Storage Web UI或Explorer生成的有效SAS令牌。

    • MW_WASB_SECRET_KEY— Authentication via one of the Account's two secret keys

      Each Storage Account has two secret keys that permit administrative privilege access. This same access can be given to MATLAB without having to create an SAS token by setting theMW_WASB_SECRET_KEYenvironment variable. For example:

      setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'

If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster by settingEnvironmentVariablesinparpool,batch,createJob, or in the Cluster Profile Manager.

For more information, seeUse Azure storage with Azure HDInsight clusters.

从中读取数据天蓝色

要从Azure Blob存储位置读取数据,请使用以下语法指定位置:

wasbs://container@account/path_to_file/file.ext

container@accountis the name of the container andpath_to_fileis the path to the file or folders.

例如,if you have a fileAirlinesMall.CSVin a folder/airlineon a test storage accountwasbs://blobcontainer@storageaccount.blob.core.windows.net/,然后您可以通过使用以下方式创建数据存储

location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', {'ArrDelay'});

您可以将Azure用于所有计算数据存储支持,包括直接读数,金宝appMapReduce, tall arrays and deep learning. For example, create anImageDatastoreobject, read a specified image from the datastore, and then display the image to screen.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)

Write Data to天蓝色

This example shows how to read tabular data from Azure into a tall array using aTabulartextdatastoreobject, preprocess it by removing missing entries and sorting, and then write it back to Azure.

setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);

To read your tall data back, use the数据storefunction.

ds = datastore('wasbs:// yourcontainer@youraccount.blob.core.windows.net/preprocesseddata/');tt =高(DS);

HadoopDistributed File System

Specify Location of Data

MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:

hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file

hostnameis the name of the host or server andpath_to_fileis the path to the file or folders. Specifying thehostnameis optional. When you do not specify thehostname, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.

例如,you can use either of these commands to create a datastore for the file,file1.txt, in a folder named数据位于一个名为的主机myserver:

  • ds = tabularTextDatastore('hdfs:///data/file1.txt')
  • ds = tabularTextDatastore('hdfs://myserver/data/file1.txt')

Ifhostnameis specified, it must correspond to the namenode defined by thefs.default.nameHadoop XML配置文件中的属性,用于您的Hadoop群集。

Optionally, you can include the port number. For example, this location specifies a host namedmyserver与港口7867, containing the filefile1.txtin a folder named数据:

'hdfs:// myserver:7867/data/file1.txt'

指定的端口号必须匹配您的HDFS配置中设置的端口号。

SetHadoopEnvironment Variable

Before reading from HDFS, use thesetenvfunction to set the appropriate environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.

  • 仅Hadoop V1 - 设置HADOOP_HOMEenvironment variable.

  • Hadoop v2 only — Set thehadoop_prefixenvironment variable.

  • 如果您使用Hadoop V1和Hadoop V2,或者HADOOP_HOMEhadoop_prefixenvironment variables are not set, then set theMATLAB_HADOOP_INSTALLenvironment variable.

例如,使用此命令设置HADOOP_HOMEenvironment variable.hadoop-folder是安装Hadoop的文件夹,并且/mypath/是该文件夹的路径。

setenv('HADOOP_HOME','/mypath/hadoop-folder');

HDFS有关园艺的数据或Cloudera

如果您当前的计算机可以访问Hortonworks或Cloudera上的HDFS数据®,那么您不必设置HADOOP_HOME或者hadoop_prefix环境变量。MATLAB使用hortonworks或Cloudera应用程序边缘节点时会自动分配这些环境变量。

Prevent Clearing Code from Memory

When reading from HDFS or when reading Sequence files locally, the数据store函数调用javaaddpath命令。此命令执行以下操作:

  • Clears the definitions of all Java®动态类路径上的文件定义的类

  • Removes all global variables and variables from the base workspace

  • Removes all compiled scripts, functions, and MEX-functions from memory

To prevent persistent variables, code files, or MEX-files from being cleared, use themlockfunction.

Write Data toHDFS

This example shows how to use aTabulartextdatastoreobject to write data to an HDFS location. Use thewritefunction to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.

ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);

To read your tall data back, use the数据storefunction.

ds = datastore('hdfs://myserver/some/path/preprocessedData/'); tt = tall(ds);

看Also

||||||||

相关话题