Casosprácticos

康奈尔生物声学科学家开发了一个高性能计算平台,用于分析大数据

挑战

Detect and classify animal sounds in huge sets of acoustic data acquired from oceans, fields, forests, and jungles

解决方案

Develop a high-performance computing platform for acoustic data analysis using MATLAB, Parallel Computing Toolbox, and MATLAB Parallel Server

结果

  • 多年的发展时间节省了
  • 分析时间从几周减少到几个小时
  • Previously unprocessed data analyzed in days

“High-performance computing with MATLAB enables us to process previously unanalyzed big data. We translate what we learn into an understanding of how human activities affect the health of ecosystems to inform responsible decisions about what humans do in the ocean and on land.”

康奈尔大学克里斯托弗·克拉克博士
生物声学研究计划使用的声学分析设备,以收集大型鲸鱼和其他海洋哺乳动物的数据。照片由Dimitri Ponirakis提供。

For more than 30 years, scientists have studied local animal populations by recording animal sounds in oceans, jungles, forests, and other natural environments. They use the results to assess the effect of man-made noise on natural environments, monitor endangered animal populations, and investigate animal communication. Passive acoustic monitoring systems record sounds continuously, generating terabytes of data. Scientists are often unable to process even 1% of this data because they lack the necessary advanced algorithms and processing capacity.

康奈尔(Cornell)鸟类学实验室的生物声学研究计划(BRP)科学家分析了大量的声学数据®,并行计算工具箱™和MATLAB Parallel Server™。该项目由海军研究办公室和国家海洋合作伙伴计划的赠款资助,由康奈尔大学的两名主要研究人员领导:BRP高级科学家兼董事Christopher Clark博士,以及Peter Data Scientist的首席数据科学家Peter Dugan博士对于brp。

“MATLAB and MATLAB parallel computing tools gave us the flexibility to dynamically improve and adapt the algorithms that we use to process our big acoustic data sets,” says Dr. Clark. “If we were using C++ or a similar language, we would not be able to move as quickly or explore as many scenarios.”

挑战

分析声学数据的研究人员必须与天气,其他动物以及附近的机械和车辆的噪音抗衡。物种内部个体的动物声音的可变性是另一个并发症。这两个因素 - 命名和可变性 - 提示假阳性和负面因素的数量,从而降低了检测算法的准确性。

处理上百tb的数据BRP is gathering presents another challenge. A typical project involves processing years of raw acoustic data—up to 10TB—recorded on multiple channels. Each channel may capture hundreds of millions of events—sounds that stand out when the data is viewed as a spectrogram. Algorithms tested on small, high-quality samples are often considerably less accurate when applied to larger, noisier data sets.

最后,BRP分析工具必须服务于广泛的研究计划,环境和转移要求。克拉克博士说:“我们最初的研究问题的答案通常会导致全新的途径探索,我们需要能够处理我们需求突然的变化。”

解决方案

BRP数据科学家使用MATLAB来开发高性能计算(HPC)软件,以自动处理声学数据。

他们通过收集希望检测到的动物的音频剪辑,在动物环境中的背景噪声以及存档的声学数据的垫子来开始检测分类项目。他们在MATLAB中工作,开发了新的或完善现有的算法,这些算法可以检测到与剪辑目录中类似的存档数据中的音频序列。

The algorithms use pattern matching, edge detection, connected region analysis, convolution, and other techniques supported by Image Processing Toolbox™ and Signal Processing Toolbox™, as well as machine learning techniques supported by Fuzzy Logic Toolbox™ and Deep Learning Toolbox™.

为了评估算法的准确性,研究人员使用统计和机器学习工具箱™来计算接收器操作特性(ROC)和其他性能曲线。

在使用并行计算工具箱上调试和优化小数据集上的算法之后,科学家使用MATLAB并行服务器上的64个工作人员群集上的完整存档数据集运行它们。

BRP团队开发了一个MATLAB接口,使研究人员能够指定算法,数据集和处理器数量。

BRP与MarineXplore和Kaggle社区合作,赞助了一项全球竞赛,其中240多名参与者提交了算法,以检测和分类北大西洋右鲸的Upsweep联系电话。BRP使用其MATLAB HPC平台识别最准确的算法,该算法将用于防止与鲸鱼相撞。

In addition to detection and classification algorithms, BRP uses MATLAB for noise analysis and acoustic modeling, in which the time and frequency dispersion effects of marine or terrestrial environments are captured and simulated.

结果

  • 多年的发展时间节省了。杜根博士说:“对预计费用的研究表明,如果我们必须自己这样做,那将需要三年,100万美元,以及许多外部帮助,以开发我们需要的HPC平台。”“使用并行计算工具箱和MATLAB并行服务器,我们在三个月内开发了该平台。”

  • 分析时间从几周减少到几个小时。“It took one of our algorithms 19 weeks to process 90 days of data,” says Dr. Dugan. “Using Parallel Computing Toolbox and MATLAB Parallel Server, we completed the same analysis on our cluster in 8 hours.”

  • Previously unprocessed data analyzed in days。“One data set captured 100,000 hours of sound. It was so large that we had previously processed less than 1% of it, estimating that it would take a year or more to process the rest,” says Dr. Dugan. “With our MATLAB HPC platform, we processed the data six times, using different detection algorithms, in two days.”

Cornell University is among the 1300 universities worldwide that provide campus-wide access to MATLAB and Simulink. With the Campus-Wide License, researchers, faculty, and students have access to a common configuration of products, at the latest release level, for use anywhere—in the classroom, at home, in the lab or in the field.