您可以读取和写入数据从远程位置using MATLAB®functions and objects, such as file I/O functions and some datastore objects. These examples show how to set up, read from, and write to remote locations on the following cloud storage platforms:
Amazon S3™ (Simple Storage Service)
天蓝色®BLOB存储(以前称为Windows Azure®Storage Blob (WASB))
Hadoop®Distributed File System (HDFS™)
MATLAB lets you use Amazon S3 as an online file storage web service offered by Amazon Web Services. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of the form
s3://bucketname/path_to_file
bucketname
is the name of the container andpath_to_file
is the path to the file or folders.
Amazon S3provides data storage through web services interfaces. You can use abucketas a container to store objects in Amazon S3.
To work with remote data in Amazon S3, you must set up access first:
注册Amazon Web Services(AWS)根帐户。看Amazon Web Services: Account.
使用您的AWS root帐户,创建一个IAM(身份和访问管理)用户。看Creating an IAM User in Your AWS Account.
Generate an access key to receive an access key ID and a secret access key. SeeManaging Access Keys for IAM Users.
使用AWS访问密钥ID,秘密访问密钥和区域配置计算机,使用AWS命令行接口工具https://aws.amazon.com/cli/. Alternatively, set the environment variables directly by usingsetenv
:
AWS_ACCESS_KEY_ID
和aws_secret_access_key
- 身份验证并启用亚马逊S3服务。(您在步骤3中生成了这对访问密钥变量。)
AWS_DEFAULT_REGION
(optional) — Select the geographic region of your bucket. The value of this environment variable is typically determined automatically, but the bucket owner might require that you set it manually.
aws_session_token
(optional) — Specify the session token if you are using temporary security credentials, such as with AWS®联合身份验证。
If you are using Parallel Computing Toolbox™, you must ensure the cluster has been configured to access S3 services. You can copy your client environment variables to the workers on a cluster by settingEnvironmentVariables
inparpool
,batch
,createJob
, or in the Cluster Profile Manager.
以下示例显示了如何使用nImageDatastore
object to read a specified image from Amazon S3, and then display the image to screen.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = imageDatastore('s3://bucketname/image_datastore/jpegfiles', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
以下示例显示了如何使用Tabulartextdatastore
object to read tabular data from Amazon S3 into a tall array, preprocess it by removing missing entries and sorting, and then write it back to Amazon S3.
setenv('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID'); setenv('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY'); ds = tabularTextDatastore('s3://bucketname/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('s3://bucketname/preprocessedData/',tt);
To read your tall data back, use the数据store
function.
ds = datastore('s3://bucketname/preprocessedData/'); tt = tall(ds);
MATLAB可让您使用Azure Blob存储进行在线文件存储。指定数据的位置时,必须使用表单的统一资源定位器(URL)指定文件或文件夹的完整路径
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container andpath_to_file
is the path to the file or folders.
天蓝色provides data storage through web services interfaces. You can use a斑点to store data files in Azure. SeeIntroduction to天蓝色for more information.
使用远程数据在Azure存储,你亩t set up access first:
Sign up for a Microsoft Azure account, seeMicrosoft Azure Account.
通过精确设置以下两个环境变量之一,设置您的身份验证详细信息setenv
:
mw_wasb_sas_token
— Authentication via Shared Access Signature (SAS)
获得SAS。有关详细信息,请参见“获取blob容器的SAS”部分使用存储资源管理器管理Azure Blob存储资源.
In MATLAB, setmw_wasb_sas_token
到SAS查询字符串。例如,
setenv MW_WASB_SAS_TOKEN '?st=2017-04-11T09%3A45%3A00Z&se=2017-05-12T09%3A45%3A00Z&sp=rl&sv=2015-12-11&sr=c&sig=E12eH4cRCLilp3Tw%2BArdYYR8RruMW45WBXhWpMzSRCE%3D'
您必须将此字符串设置为从Azure Storage Web UI或Explorer生成的有效SAS令牌。
MW_WASB_SECRET_KEY
— Authentication via one of the Account's two secret keys
Each Storage Account has two secret keys that permit administrative privilege access. This same access can be given to MATLAB without having to create an SAS token by setting theMW_WASB_SECRET_KEY
environment variable. For example:
setenv MW_WASB_SECRET_KEY '1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF'
If you are using Parallel Computing Toolbox, you must copy your client environment variables to the workers on a cluster by settingEnvironmentVariables
inparpool
,batch
,createJob
, or in the Cluster Profile Manager.
For more information, seeUse Azure storage with Azure HDInsight clusters.
要从Azure Blob存储位置读取数据,请使用以下语法指定位置:
wasbs://container@account/path_to_file/file.ext
container@account
is the name of the container andpath_to_file
is the path to the file or folders.
例如,if you have a fileAirlinesMall.CSV
in a folder/airline
on a test storage accountwasbs://blobcontainer@storageaccount.blob.core.windows.net/
,然后您可以通过使用以下方式创建数据存储
location = 'wasbs://blobContainer@storageAccount.blob.core.windows.net/airline/airlinesmall.csv';
ds = tabularTextDatastore(location, 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', {'ArrDelay'});
您可以将Azure用于所有计算数据存储支持,包括直接读数,金宝appMapReduce
, tall arrays and deep learning. For example, create anImageDatastore
object, read a specified image from the datastore, and then display the image to screen.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = imageDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/', ... 'IncludeSubfolders', true, 'LabelSource', 'foldernames'); img = ds.readimage(1); imshow(img)
This example shows how to read tabular data from Azure into a tall array using aTabulartextdatastore
object, preprocess it by removing missing entries and sorting, and then write it back to Azure.
setenv('MW_WASB_SAS_TOKEN', 'YOUR_WASB_SAS_TOKEN'); ds = tabularTextDatastore('wasbs://YourContainer@YourAccount.blob.core.windows.net/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('wasbs://YourContainer@YourAccount.blob.core.windows.net/preprocessedData/',tt);
To read your tall data back, use the数据store
function.
ds = datastore('wasbs:// yourcontainer@youraccount.blob.core.windows.net/preprocesseddata/');tt =高(DS);
MATLAB lets you use Hadoop Distributed File System (HDFS) as an online file storage web service. When you specify the location of the data, you must specify the full path to the files or folders using a uniform resource locator (URL) of one of these forms:
hdfs:/path_to_file
hdfs:///path_to_file
hdfs://hostname/path_to_file
hostname
is the name of the host or server andpath_to_file
is the path to the file or folders. Specifying thehostname
is optional. When you do not specify thehostname
, Hadoop uses the default host name associated with the Hadoop Distributed File System (HDFS) installation in MATLAB.
例如,you can use either of these commands to create a datastore for the file,file1.txt
, in a folder named数据
位于一个名为的主机myserver
:
ds = tabularTextDatastore('hdfs:///data/file1.txt')
ds = tabularTextDatastore('hdfs://myserver/data/file1.txt')
Ifhostname
is specified, it must correspond to the namenode defined by thefs.default.name
Hadoop XML配置文件中的属性,用于您的Hadoop群集。
Optionally, you can include the port number. For example, this location specifies a host namedmyserver
与港口7867
, containing the filefile1.txt
in a folder named数据
:
'hdfs:// myserver:7867/data/file1.txt'
指定的端口号必须匹配您的HDFS配置中设置的端口号。
Before reading from HDFS, use thesetenv
function to set the appropriate environment variable to the folder where Hadoop is installed. This folder must be accessible from the current machine.
仅Hadoop V1 - 设置HADOOP_HOME
environment variable.
Hadoop v2 only — Set thehadoop_prefix
environment variable.
如果您使用Hadoop V1和Hadoop V2,或者HADOOP_HOME
和hadoop_prefix
environment variables are not set, then set theMATLAB_HADOOP_INSTALL
environment variable.
例如,使用此命令设置HADOOP_HOME
environment variable.hadoop-folder
是安装Hadoop的文件夹,并且/mypath/
是该文件夹的路径。
setenv('HADOOP_HOME','/mypath/hadoop-folder');
如果您当前的计算机可以访问Hortonworks或Cloudera上的HDFS数据®,那么您不必设置HADOOP_HOME
或者hadoop_prefix
环境变量。MATLAB使用hortonworks或Cloudera应用程序边缘节点时会自动分配这些环境变量。
When reading from HDFS or when reading Sequence files locally, the数据store
函数调用javaaddpath
命令。此命令执行以下操作:
Clears the definitions of all Java®动态类路径上的文件定义的类
Removes all global variables and variables from the base workspace
Removes all compiled scripts, functions, and MEX-functions from memory
To prevent persistent variables, code files, or MEX-files from being cleared, use themlock
function.
This example shows how to use aTabulartextdatastore
object to write data to an HDFS location. Use thewrite
function to write your tall and distributed arrays to a Hadoop Distributed File System. When you call this function on a distributed or tall array, you must specify the full path to a HDFS folder. The following example shows how to read tabular data from HDFS into a tall array, preprocess it by removing missing entries and sorting, and then write it back to HDFS.
ds = tabularTextDatastore('hdfs://myserver/some/path/dataset/airlinesmall.csv', ... 'TreatAsMissing', 'NA', 'SelectedVariableNames', {'ArrDelay'}); tt = tall(ds); tt = sortrows(rmmissing(tt)); write('hdfs://myserver/some/path/preprocessedData/',tt);
To read your tall data back, use the数据store
function.
ds = datastore('hdfs://myserver/some/path/preprocessedData/'); tt = tall(ds);
数据store
|成像
|imread
|imshow
|javaaddpath
|mlock
|setenv
|Tabulartextdatastore
|write