" id="b1img" alt="米课”N+1“思维建站-非官方免费技术支持,Wordpress/Zencart/Opencart建站、SEO" title="米课”N+1“思维建站-非官方免费技术支持,Wordpress/Zencart/Opencart建站、SEO">
  • " alt="米课”N+1“思维建站-非官方免费技术支持,Wordpress/Zencart/Opencart建站、SEO" title="米课”N+1“思维建站-非官方免费技术支持,Wordpress/Zencart/Opencart建站、SEO" />
  • " alt="Windows/Linux服务器运维技术支持 环境搭建、应用发布、服务器管理、虚拟化、云计算" title="Windows/Linux服务器运维技术支持 环境搭建、应用发布、服务器管理、虚拟化、云计算" />
  • " alt="高校/小型企业网络运维与建设免费技术支持,网络规划、网络优化、故障排除、网络管理" title="高校/小型企业网络运维与建设免费技术支持,网络规划、网络优化、故障排除、网络管理" />

别了网工——我的数据分析之路(七)数据可视化分析(共享单车案例)

资源分享2018-06-18 itlogger阅读(510) 评论(0)

【共享单车数据可视化分析】

1.项目简介

  共享单车服务一种新型的交通工具分时租赁业务,主要依靠载体为自行车,整个租赁过程(获得会员资格、租赁和自行车返回的过程)通过城市的亭站网络自动进行,在最大化利用了公共道路利用率的同时,也能起到健身的作用。

  共享单车系统生成的数据对研究人员很有吸引力, 因为旅行、出发地点、到达地点和时间经过的时间都被明确记录下来。因此, 共享单车系统作为一种传感器网络, 可以用来研究城市的机动性。

  项目目标:分析共享单车历史使用模式与天气数据之间的关系, 为下一步的自行车租赁需求预测提供数据支撑。项目数据来自kaggle Bike Sharing Demand,

2.理解数据

2.1 数据表字段

datetime - hourly date + timestamp  如:{1/1/2011  10:00:00 AM}
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - 1 = 假期,0 = 非假期
workingday - 1 = 工作日,0=假期/周末
weather -
    1: Clear, Few clouds, Partly cloudy, Partly cloudy
    2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - 摄氏温度
atemp - "feels like" temperature in Celsius
humidity - 湿度
windspeed - 风速
casual - number of non-registered user rentals initiated非已注册用户租用数量
registered - number of registered user rentals initiated注册用户租用数量
count - number of total rentals (Dependent Variable) = casual+registered 租赁数

2.2 数据信息

import pandas as pd
import numpy as np
import seaborn as sns
import pylab
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
dailyData = pd.read_csv("I:/3-10-Python/bike/train.csv")
testData = pd.read_csv("I:/3-10-Python/bike/test.csv")
dailyData.rename(columns={"count": "rentCount"},inplace = True)
dailyData.shape
#10886条数据,12列,无缺失数据
(10886, 12)
dailyData.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered rentCount
0 1/1/2011 0:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 1/1/2011 1:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 1/1/2011 2:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 1/1/2011 3:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 1/1/2011 4:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
dailyData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime      10886 non-null object
season        10886 non-null int64
holiday       10886 non-null int64
workingday    10886 non-null int64
weather       10886 non-null int64
temp          10886 non-null float64
atemp         10886 non-null float64
humidity      10886 non-null int64
windspeed     10886 non-null float64
casual        10886 non-null int64
registered    10886 non-null int64
rentCount     10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
dailyData.describe()
fig=plt.figure(figsize=(16,16),dpi=80)
ax1=plt.subplot(5,4,1)
ax2=plt.subplot(5,4,2)
ax3=plt.subplot(5,4,3)
ax4=plt.subplot(5,4,4)
ax5=plt.subplot(5,4,5)
ax6=plt.subplot(5,4,6)
ax7=plt.subplot(5,4,7)
ax8=plt.subplot(5,4,8)
ax9=plt.subplot(5,4,9)
ax10=plt.subplot(5,4,10)
sns.distplot(dailyData.rentCount,ax=ax1)
sns.boxplot(dailyData.rentCount,ax=ax2)
sns.distplot(dailyData.temp,ax=ax3)
sns.boxplot(dailyData.temp,ax=ax4)
sns.distplot(dailyData.atemp,ax=ax5)
sns.boxplot(dailyData.atemp,ax=ax6)
sns.distplot(dailyData.humidity,ax=ax7)
sns.boxplot(dailyData.humidity,ax=ax8)
sns.distplot(dailyData.windspeed,ax=ax9)
sns.boxplot(dailyData.windspeed,ax=ax10)
<matplotlib.axes._subplots.AxesSubplot at 0x13a34f5b668>

(一)各天气指标平均值 / 中位数

温度:20.23086/20.50000

舒适温度:23.655084/24.240000

湿度:61.886460/62.000000

风力:12.799395/12.998000

租用数:191.574132/145.000000

3.特征工程

3.1 datetime字段提取年、月、日、小时、星期几 *** {1/1/2011 10:00:00 AM} ***


from datetime import datetime from datetime import date import calendar import time #date = dailyData.datetime.split('/')[0] #date = lambda x : x.split(' ')[0] dailyData['date'] = dailyData.datetime.apply(lambda x : x.split(' ')[0]) dailyData['month'] = dailyData.datetime.apply(lambda x : x.split('/')[0]).astype(int) dailyData['day'] = dailyData.datetime.apply(lambda x : x.split('/')[1]).astype(int) dailyData['year'] = dailyData.datetime.apply(lambda x : x.split('/')[2].split(' ')[0]).astype(int) dailyData['hours'] = dailyData.datetime.apply(lambda x : x.split(' ')[1].split(':')[0]).astype('int') dailyData['weekday']=dailyData.datetime.apply(lambda dateString:calendar.day_name[datetime.strptime(dateString,"%m/%d/%Y %H:%M").weekday()]) weekdayDict={'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7} dailyData['weekday']=dailyData['weekday'].map(weekdayDict)
dailyData.loc[:,['datetime','date','month','year','hours','weekday','rentCount']].head()
datetime date month year hours weekday rentCount
0 1/1/2011 0:00 1/1/2011 1 2011 0 6 16
1 1/1/2011 1:00 1/1/2011 1 2011 1 6 40
2 1/1/2011 2:00 1/1/2011 1 2011 2 6 32
3 1/1/2011 3:00 1/1/2011 1 2011 3 6 13
4 1/1/2011 4:00 1/1/2011 1 2011 4 6 1

3.2 season one-hot编码

dailyData['season'] = dailyData.season.map({1:'Spring',2:'Summer',3:'Fall',4:'Winter'})
seasonDf = pd.DataFrame
seasonDf = pd.get_dummies(dailyData['season'],prefix='Season')

3.3 天气one-hot编码

dailyData['weather'] = dailyData.weather.map({1:'Clear',2:'Mist',3:'Light',4:'Heavy'})
weatherDf = pd.DataFrame
weatherDf = pd.get_dummies(dailyData['weather'],prefix='weather')
dailyData=pd.concat([dailyData,seasonDf],axis=1)
dailyData=pd.concat([dailyData,weatherDf],axis=1)

*** 选取与count关联度较大的特征:
registered,casual,temp,atemp,year,month,Season_Fall,Season_Spring,weather_Clear,windspeed,weather_Light,humidity***

4.数据可视化分析

4.1 时间对租用数量的影响

yearmonthDf = dailyData.groupby(['year','month']).agg({'rentCount':'sum'}).reset_index()
yearmonthDf['y']=yearmonthDf['year'].astype(str)+"/"+yearmonthDf.month.astype(str)
months=mdates.MonthLocator()
days=mdates.DayLocator()
timeFmt=mdates.DateFormatter('%Y-%m')
xs = [datetime.strptime(d, '%Y/%m').date() for d in yearmonthDf['y']]
fig=plt.figure(figsize=(24,8),dpi=80)
ax=plt.subplot(2,1,1)
ax.xaxis.set_major_formatter(timeFmt)
plt.plot(xs,yearmonthDf['rentCount'],marker='o')
plt.gcf().autofmt_xdate()
#plt.scatter(dailyData.datetime,dailyData['count'])
ax2=plt.subplot(2,1,2)
sns.swarmplot(x="month",y="rentCount",data=dailyData,ax=ax2)
plt.show()

import matplotlib.gridspec as gridspec
gs = gridspec.GridSpec(3, 2)

monthDf = dailyData.groupby(['month']).agg({'rentCount':'mean'}).reset_index()
seasonDf = dailyData.groupby(['season']).agg({'rentCount':'mean'}).reset_index()

#monthDf
fig=plt.figure(figsize=(16,12),dpi=80)
plt.subplots_adjust(hspace=0.5)
ax1=plt.subplot(gs[0,0])
ax1.set_title("Month-Mean",loc="center")
ax1.set_xlabel("Month")
ax1.set_xticks(range(1,13,1))
ax1.set_ylabel("Mean")
ax2=plt.subplot(gs[0,1])
ax2.set_title("Season-Mean")
ax2.set_xlabel("Season")
ax2.set_ylabel("Mean")

ax1.plot(monthDf.month,monthDf['rentCount'],marker='o')
ax2.plot(seasonDf.season,seasonDf['rentCount'],marker='o')

hourDf = dailyData.groupby(['hours']).agg({'rentCount':'mean'}).reset_index()
#hourDf.sort_values(by='hours')
ax3=plt.subplot(gs[1,0])
ax3.set_title("Hour-Mean")
ax3.set_xticks(hourDf.hours)
ax3.set_xlabel("Hour")
ax3.set_ylabel("Mean")
ax3.plot(hourDf.hours,hourDf['rentCount'],marker='o')
ax4=plt.subplot(gs[1,1])
#ax4.set_xticklabels(['Spring','Summer','Fall','Winter'])
seasonDf1=dailyData.groupby('season').agg({'rentCount':'sum'})
seasonDf1.plot(x="season",y="rentCount",kind='pie',autopct='%.3f%%',ax=ax4)
ax5=plt.subplot(gs[2,0])
sns.violinplot(x="holiday", y="rentCount", data=dailyData, split=True,ax=ax5);
ax6=plt.subplot(gs[2,1])
sns.violinplot(x="workingday", y="rentCount", data=dailyData, split=True,ax=ax6);

时间vs租用量

单车租用量逐年增加,春季尤其1月为淡季,随后稳步增长夏季秋季为高峰期,冬季的11-12月开始下滑

8点、17-18点为高峰时段,晚上21点后开始下滑

是否假期、工作日对租用量影响不大

4.2 环境对租用量的影响

from pandas_datareader import data
fig=plt.figure(figsize=(16,8),dpi=80)
plt.subplots_adjust(hspace=0.6)
ax1=plt.subplot(2,2,1)
ax2=plt.subplot(2,2,2)
ax3=plt.subplot(2,2,3)
ax4=plt.subplot(2,2,4)

tempDf = pd.DataFrame({'temp': dailyData['temp'], 
                     'count': dailyData['rentCount']})
atempDf = pd.DataFrame({'atemp': dailyData['atemp'], 
                     'count': dailyData['rentCount']})
humilityDf = pd.DataFrame({'humidity': dailyData['humidity'], 
                     'count': dailyData['rentCount']})
windspeedDf = pd.DataFrame({'windspeed': dailyData['windspeed'], 
                     'count': dailyData['rentCount']})
# 等分价格为10个区间
tempDf.sort_values(by=['temp'],inplace=True)
tempCountQuar = pd.cut(tempDf.temp, 10)
tempCount=tempCountQuar.value_counts().sort_index()

atempCountQuar = pd.cut(atempDf.atemp, 10)
atempCount=atempCountQuar.value_counts().sort_index()

humilityCount=pd.cut(humilityDf.humidity,10).value_counts().sort_index()
windspeedCount=pd.cut(windspeedDf.windspeed,10).value_counts().sort_index()
tempCount.plot(kind='bar',ax=ax1)
ax1.set_xlabel('temp')
ax1.set_ylabel('count')
ax1.set_title('Chart-1  temp-count')
ax1.set_xticklabels(tempCount.index,rotation=40)
atempCount.plot(kind='bar',ax=ax2)

ax2.set_xlabel('atemp')
ax2.set_ylabel('count')
ax2.set_title('Chart-2  atemp-count')
ax2.set_xticklabels(atempCount.index,rotation=40)
humilityCount.plot(kind='bar',ax=ax3)
ax3.set_xticklabels(humilityCount.index,rotation=40)
ax3.set_xlabel('humility')
ax3.set_ylabel('count')
ax3.set_title('Chart-3  humility-count')
windspeedCount.plot(kind='bar',ax=ax4)
ax4.set_xlabel('windspeed')
ax4.set_ylabel('count')
ax4.set_title('Chart-4  windspeed-count')
plt.xticks(rotation=45)
#ax4.set_xticklabels(windspeedCount.index,rotation=40)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

sns.violinplot(x="weather", y="rentCount", data=dailyData);

环境VS租用量

(1)低温和高温下,租用量减少

(2)低湿度、或湿度偏高的情况下,租用量减少

(3)风力过大或风力很小的情况下,租用量减少

(4)天气状况方面,Heavy的天数很少,Light天平均租用量会明显偏低

corrDailyDf=pd.DataFrame({'month': dailyData['month'],'hours': dailyData['hours'],'temp': dailyData['temp'], 'atemp': dailyData['atemp'], 'humility': dailyData.humidity,'windspeed': dailyData['windspeed'],
                     'weather_Light': dailyData['weather_Light'],'rentCount': dailyData['rentCount']})
corrDf=corrDailyDf.corr()
corrDf
atemp hours humility month rentCount temp weather_Light windspeed
atemp 1.000000 0.140343 -0.043536 0.264173 0.389784 0.984948 -0.031154 -0.057473
hours 0.140343 1.000000 -0.278011 -0.006818 0.400601 0.145430 0.014030 0.146631
humility -0.043536 -0.278011 1.000000 0.204537 -0.317371 -0.064949 0.295894 -0.318607
month 0.264173 -0.006818 0.204537 1.000000 0.166862 0.257589 -0.000392 -0.150192
rentCount 0.389784 0.400601 -0.317371 0.166862 1.000000 0.394454 -0.117519 0.101369
temp 0.984948 0.145430 -0.064949 0.257589 0.394454 1.000000 -0.025715 -0.017852
weather_Light -0.031154 0.014030 0.295894 -0.000392 -0.117519 -0.025715 1.000000 0.045597
windspeed -0.057473 0.146631 -0.318607 -0.150192 0.101369 -0.017852 0.045597 1.000000
corrDf['rentCount'].sort_values(ascending=False)
rentCount        1.000000
hours            0.400601
temp             0.394454
atemp            0.389784
month            0.166862
windspeed        0.101369
weather_Light   -0.117519
humility        -0.317371
Name: rentCount, dtype: float64
转载请注明 :IT樵客
文章地址:http://www.itlogger.com/res/2494.html
标签:
相关文章

发表评论

电子邮件地址不会被公开。 必填项已用*标注