Main Content

splitlabels

Find indices to split labels according to specified proportions

Description

使用这个函数当你正在做一个食蟹猴e or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.

example

idxs= splitlabels(lblsrc,p)finds logical indices that split the labels inlblsrcbased on the proportions or number of labels specified inp.

example

idxs= splitlabels(lblsrc,p,'randomized')randomly assigns the specified proportion of label values to each index set inidxs.

example

idxs= splitlabels(___,Name,Value)specifies additional input arguments using name-value pairs. For example,'UnderlyingDatastoreIndex',3splits the labels only in the third underlying datastore of a combined datastore.

Examples

collapse all

Read William Shakespeare's sonnets with thefilereadfunction. Extract all the vowels from the text and convert them to lowercase.

sonnets = fileread("sonnets.txt"); vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';

Count the number of instances of each vowel.

cnts = countlabels(vowels)
cnts=5×3 table实验室el Count Percent _____ _____ _______ a 4940 18.368 e 9028 33.569 i 4895 18.201 o 5710 21.232 u 2321 8.6302

Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.

spltn = splitlabels(vowels,[500 300]);forkj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj}));endcntsn{:}
ans=5×3 table实验室el Count Percent _____ _____ _______ a 500 20 e 500 20 i 500 20 o 500 20 u 500 20
ans=5×3 table实验室el Count Percent _____ _____ _______ a 300 20 e 300 20 i 300 20 o 300 20 u 300 20
ans=5×3 table实验室el Count Percent _____ _____ _______ a 4140 18.083 e 8228 35.94 i 4095 17.887 o 4910 21.447 u 1521 6.6437

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

spltp = splitlabels(vowels,[0.5 0.3]);forkj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj}));endcntsp{:}
ans=5×3 table实验室el Count Percent _____ _____ _______ a 2470 18.367 e 4514 33.566 i 2448 18.203 o 2855 21.23 u 1161 8.6333
ans=5×3 table实验室el Count Percent _____ _____ _______ a 1482 18.371 e 2708 33.569 i 1468 18.198 o 1713 21.235 u 696 8.6277
ans=5×3 table实验室el Count Percent _____ _____ _______ a 988 18.368 e 1806 33.575 i 979 18.2 o 1142 21.231 u 464 8.6261

Read William Shakespeare's sonnets with thefilereadfunction. Remove all nonalphabetic characters from the text and convert to lowercase.

sonnets = fileread("sonnets.txt"); letters = lower(sonnets(regexp(sonnets,"[A-z]")))';

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) ="vowel"; T = table(letters,type,'VariableNames',["Letter""Type"]); head(T)
ans=8×2 tableLetter Type ______ ___________ t "consonant" h "consonant" e "vowel" s "consonant" o "vowel" n "consonant" n "consonant" e "vowel"

Display the number of instances of each category.

cnt = countlabels(T,'TableVariable',"Type")
cnt=2×3 tableType Count Percent _________ _____ _______ consonant 46516 63.365 vowel 26894 36.635

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

splt = splitlabels(T,0.6,'TableVariable',"Type"); sixty = countlabels(T(splt{1},:),'TableVariable',"Type")
sixty=2×3 tableType Count Percent _________ _____ _______ consonant 27910 63.366 vowel 16136 36.634
forty = countlabels(T(splt{2},:),'TableVariable',"Type")
forty=2×3 tableType Count Percent _________ _____ _______ consonant 18606 63.363 vowel 10758 36.637

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the lettery, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.

splt = splitlabels(T,0.6,'Exclude',"y"); sixti = countlabels(T(splt{1},:),'TableVariable',"Type")
sixti=2×3 tableType Count Percent _________ _____ _______ consonant 26719 62.346 vowel 16137 37.654
forti = countlabels(T(splt{2},:),'TableVariable',"Type")
forti=2×3 tableType Count Percent _________ _____ _______ consonant 17813 62.349 vowel 10757 37.651

Split the table into two sets of the same size. Include only the letterseands. Randomize the sets.

halves = splitlabels(T,0.5,'randomized','Include',["e""s"]); cnt = countlabels(T(halves{1},:))
cnt=2×3 tableLetter Count Percent ______ _____ _______ e 4514 64.385 s 2497 35.615

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers asA, 30 asB, and 30 asC. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

dsData = arrayDatastore(randn(100,1)); dsLabels = arrayDatastore([repmat("A",40,1); repmat("B",30,1); repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,'UnderlyingDatastoreIndex',2)
cnt=3×3 table实验室el Count Percent _____ _____ _______ A 40 40 B 30 30 C 30 30

Split the data set into two sets, one containing 60% of the numbers and the other with the rest.

splitIndices = splitlabels(dsDataset,0.6,'UnderlyingDatastoreIndex',2); dsDataset1 = subset(dsDataset,splitIndices{1}); cnt1 = countlabels(dsDataset1,'UnderlyingDatastoreIndex',2)
cnt1=3×3 table实验室el Count Percent _____ _____ _______ A 24 40 B 18 30 C 18 30
dsDataset2 = subset(dsDataset,splitIndices{2}); cnt2 = countlabels(dsDataset2,'UnderlyingDatastoreIndex',2)
cnt2=3×3 table实验室el Count Percent _____ _____ _______ A 16 40 B 12 30 C 12 30

Input Arguments

collapse all

Input label source, specified as one of these:

  • A categorical vector.

  • A string vector or a cell array of character vectors.

  • A numeric vector or a cell array of numeric scalars.

  • A logical vector or a cell array of logical scalars.

  • 一个包含任何previ表变量ous data types.

  • A datastore whosereadallfunction returns any of the previous data types.

  • ACombinedDatastoreobject containing an underlying datastore whosereadallfunction returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.

lblsrcmust contain labels that can be converted to a vector with a discrete set of categories.

Example:lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C" "D"])creates the label source as a ten-sample categorical vector with four categories:A,B,C, andD.

Example:lblsrc = [0 7 2 5 11 17 15 7 7 11]creates the label source as a ten-sample numeric vector.

Data Types:single|double|int8|int16|int32|int64|uint8|uint16|uint32|uint64|logical|char|string|table|cell|categorical

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

  • Ifpis a scalar,splitlabelsfinds two splitting index sets and returns a two-element cell array inidxs.

    • Ifpis an integer, the first element ofidxscontains a vector of indices pointing to the firstpvalues of each label category. The second element ofidxscontains indices pointing to the remaining values of each label category.

    • Ifpis a value in the range (0, 1) andlblsrchasKielements in theith category, the first element ofidxscontains a vector of indices pointing to the firstp×Kivalues of each label category. The second element ofidxscontains the indices of the remaining values of each label category.

  • Ifpis a vector withNelements of the formp1,p2, …,pN,splitlabelsfindsN+ 1splitting index sets and returns an(N+ 1)-element cell array inidxs.

    • Ifpis a vector of integers, the first element ofidxsis a vector of indices pointing to the firstp1values of each label category, the next element ofidxscontains the nextp2values of each label category, and so on. The last element inidxscontains the remaining indices of each label category.

    • Ifpis a vector of fractions andlblsrchasKielements of theith category, the first element ofidxsis a vector of indices concatenating the firstp1×Kivalues of each category, the next element ofidxscontains the nextp2×Kivalues of each label category, and so on. The last element inidxscontains the remaining indices of each label category.

Note

  • Ifpcontains fractions, then the sum of its elements must not be greater than one.

  • Ifpcontains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

Data Types:single|double|int8|int16|int32|int64|uint8|uint16|uint32|uint64

Name-Value Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:'TableVariable',"AreaCode",'Exclude',["617" "508"]specifies that the function split labels based on telephone area code and exclude numbers from Boston and Natick.

实验室els to include in the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels inlblsrc. Each category in the vector or cell array must match one of the label categories inlblsrc.

实验室els to exclude from the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels inlblsrc. Each category in the vector or cell array must match one of the label categories inlblsrc.

Table variable to read, specified as a character vector or string scalar. If this argument is not specified, thensplitlabelsuses the first table variable.

Underlying datastore index, specified as an integer scalar. This argument applies whenlblsrcis aCombinedDatastoreobject.splitlabelscounts the labels in the datastore obtained using theUnderlyingDatastoresproperty oflblsrc.

Output Arguments

collapse all

Splitting indices, returned as a cell array.

Introduced in R2021a