分析Uber Ride共享GPS数据

发表Loren Shure,September 6, 2014

22次观看（最近30天）|0喜欢|9 comments

我们中的许多人都携带可以跟踪我们的GPS位置的智能手机，这是一个有趣的数据来源。我们如何在MATLAB中分析GPS数据？

Today's guest blogger,Toshi Takeuchi，想分享来自流行的乘车共享服务Uber的公共GPS数据集的分析。

介绍

Uberis a ride sharing service that connects passengers with private drivers through a mobile app and takes care of payment. They are in fact so popular that you hear about them in the news due to their conflicts with local traffic regulations and taxi business interests.

Uber的乘车共享GPS数据可在Infochimps.com上公开获得，因此我将其用于此分析（不幸的是，它不再可用）。我们可以从这个数据集中学到什么？

Uber匿名GPS日志

首先，让我们从上面的链接（zipped TSV文件）下载数据集，该数据集包含从Uber汽车中从移动应用程序中获取的GPS日志，这些GPS日志正在积极运送旧金山的乘客。数据通过删除名称，跳闸起点和终点而被匿名化。日期也被取代。工作日和一天中的时间仍然完好无损。

出于此分析的目的，让我们专注于城市中捕获的数据，并将其视为可视化Mapping Toolbox。

运行脚本加载数据。查看loaddata.mto see the details.

loadData

覆盖GPS地图上的点。

状态= geoshape（shaperead（shaperead）（'usastatehi','UseGeoCoords'， 真的））;latlim = [min（t.lat）max（t.lat）];lonlim = [min（t.lon）max（t.lon）];海洋= [0.7 0.8 1];土地= [0.9 0.9 0.8];图ax = usamap（latlim，lonlim）;setm（ax，'FFaceColor', ocean) geoshow(states,'FaceColor'，土地）geoshow(T.Lat,T.Lon,'DisplayType','观点','Marker','.',...“标记”，4，“标记为彩色”，[0 0 1]）标题（“ Uber GPS日志数据”）xlabel（'旧金山') textm(37.802069,-122.446618,'Marina'）TextM（37.808376，-122.426105，'Fishermans Wharf'）TextM（37.797322，-122.482409，'Presidio'）TextM（37.774546，-122.412329，'soma'）TextM（37.770731，-122.440481，'Haight') textm(37.818276,-122.498546,'金门大桥') textm(37.819632,-122.376065,'海湾大桥')

Does the usage change over time?

Let's start with a basic question - how does the use of Uber service change over time. We can usegrpstats总结按特定分类值分组的数据，例如DayName和时间临时, which were added in the data loading process.

Get grouped summaries.

byDay = grpstats(T(:,{'Lat','Lon','dayname'}），'dayname'）；bydaytime = grpstats（t（：，{'Lat','Lon','Timeofday','dayname'}），...{'dayname','Timeofday'}）;

Reshape the count of entries into a 24x7 matrix.

byDayTimeCount = reshape(byDayTime.GroupCount,24,7)';

在一周中的一天和每天每天的小时绘制数据。

figure subplot(2,1,1) bar(byDay.GroupCount); set(gca,'xtick'，1：7，'Xticklabel',cellstr(byDay.DayName)); subplot(2,1,2) plot(byDayTimeCount'); set(gca,'xtick'，1：24）;Xlabel（'Hours by Day of Week'）；传奇（'Mon','Tue','星期三','thu','Fri','sat','太阳',...'方向','水平的','地点',“南部”)

看来，在周末（星期五至周日）和一天中凌晨的使用峰值上升。旧金山有一个非常活跃的夜生活！

他们在周末去哪里？

有没有办法弄清楚人们在周末的去向？即使数据集不包含单个旅行的实际起点和终点，我们仍然可能会了解流量如何通过查看每个记录的第一个和最后一个点如何流动。

We can extract the starting and ending location data for weekend rides. Clickgetstartendpoints.m看看它是如何完成的。如果您想运行此脚本，请下载districts.xlsx也是。

％在这里加载预处理数据| startend.mat |节省时间和情节他们的起点。% getStartEndPoints % commented out to save timeloadstartend.mat％加载预处理数据图ax = usamap（latlim，lonlim）;setm（ax，'FFaceColor', ocean) geoshow(states,'FaceColor'，土地）geoshow(startEnd.StartLat,startEnd.StartLon,'DisplayType','观点',...'Marker','.',“标记”，5，“标记为彩色”，[0 0 1]）标题（'Uber Weekend Rides - Starting Points'）xlabel（'旧金山') textm(37.802069,-122.446618,'Marina'）TextM（37.808376，-122.426105，'Fishermans Wharf'）TextM（37.797322，-122.482409，'Presidio'）TextM（37.774546，-122.412329，'soma'）TextM（37.770731，-122.440481，'Haight') textm(37.818276,-122.498546,'金门大桥') textm(37.819632,-122.376065,'海湾大桥')

When you plot the longitude and latitude data, you just get messy point clouds and it is hard to see what's going on. Instead, I broke the map of San Francisco into rectangular blocks to approximate its districts. Here is the new plot of starting points by district.

dist =类别（startend.startdist）;cc = hsv（长度（dist））;图ax = usamap（latlim，lonlim）;setm（ax，'FFaceColor', ocean) geoshow(states,'FaceColor'，土地）fori = 1:length(dist) inDist = startEnd.StartDist == dist(i); geoshow(startEnd.StartLat(inDist),startEnd.StartLon(inDist),...'DisplayType','观点','Marker','.',“标记”，5，“标记为彩色”，cc（i，:)）end标题（“ Uber Weekend Rides-乘区的起点”）xlabel（'旧金山') textm(37.802069,-122.446618,'Marina'）TextM（37.808376，-122.426105，'Fishermans Wharf'）TextM（37.797322，-122.482409，'Presidio'）TextM（37.774546，-122.412329，'soma'）TextM（37.770731，-122.440481，'Haight') textm(37.818276,-122.498546,'金门大桥') textm(37.819632,-122.376065,'海湾大桥')

Visualizing the traffic patterns with Gephi

这是朝正确方向迈出的一步。现在，我们已经有了由地区分组的起点和终点，我们可以将游乐设施表示为不同地区之间的联系 - 这本质上是一张图形，带有节点和骑行作为边缘的区域。为了可视化此图，我们可以使用流行的社交网络分析工具Gephi，它也被用于另一篇文章Analyzing Twitter with MATLAB。

您可以导出StartDist和端派as the edge list to Gephi in CSV format.

writetable（startend（：，{，{“起步主义者”,“终点”}），'edgelist.csv',...'Write Variablenames'，错误的）

导出边缘列表后，您可以绘制Gephi区（节点）之间的连接（边缘）。现在，看到人们在周末去哪里要容易得多！要查看更大的图像，请查看PDF版本。

区域节点的大小代表他们in-degrees，传入的连接数量，您可以将其视为目的地的普及度的衡量标准。SOMA，HAIGHT，MISSION DISTRANT，DOWNTOWN和CASTRO是基于此措施的受欢迎地点。
The districts are colored based on their模块化, which basically means which cluster of nodes they belong to. It looks like people hang around set of districts that are nearby - SOMA, Downtown, Mission District are all located towards the south (green). The Castro, Haight, Western Addition in the center (purple) and it is strongly connected to Richmond and Sunset District in the west. Since those are residential areas, it seems people from those areas hang out in the other districts in the same cluster.
The locals don't seem to go to Fisherman's Wharf or Chinatown in the north (red) very much - they are probably considered not cool because of tourists?