The Boston area bike share system (BSS), originally “Hubway,” now “Bluebikes”, has been in existence since 2011, and there have been a number of interesting visualizations and data explorations about it. In this post I wanted to look at the effects of weather on the system utilization. I am aware of one other attempt to understand the effect of weather on ridership; however, here rather than focusing on predicting station utilization, I try to get a better understanding of the factors that affect overall system ridership. The present analysis finds that temperature, rain, and the amount of sky cover can strongly predict the number of daily rides of the Boston-area bike share system. Conversely, wind, relative humidity, or dew point temperature are not very significant. These findings are in line with what has been reported in the academic literature, though in this analysis I also focus on the differential effects of weather on weekend versus weekday ridership, where we see that atmospheric conditions have a greater effect on the number of rides on the weekends than on weekdays.
Introduction
The approach adopted here is to try to model the daily ridership of the Boston bike share system (our dependent variable) from a number of independent variables, or features. If a model could reasonably be fitted to the ridership data, we assume that we can learn something about the relative importance of the features, or independent variables in how they affect the ridership. The features considered for the model broadly fall into the following categories:
- Weather-related: precipitation, temperature, wind speed, etc.
- Calendrical: day of the week, holidays, and such.
- BSS-specific: the number of stations in the bike share system
- Other-transportation related: the number of customers entering to the Boston subway system.
- Population data: size, age, etc.
I discuss the features and findings in the following sections. The data set of trips used in this analysis included trips from 2015 through the end of 2023, and other data sets used in this analysis, other than population data, had included the same dates.
Features of the Bike Share System
The number of stations (places to rent and leave a bike) in the Boston bike share system is not constant over time: there are seasonal effects, where some stations shut down over the winter months, and more long-term trends as stations are added over time as the network size increases. Here is an animation of the BSS stations near the center of Boston over time:
While it seems intuitive that the number of bike share stations should affect the number of rides in the system, there are two potential issues with including station count as an independent variable: stability of station ids, and circularity of using BSS station counts to predict the number of daily rides. The first is a technical issue and is discussed in the “Gory Details” section below, while the latter could be a serious modeling concern. Since there is no data set of active stations over time, the existence of stations must be inferred from the ride data. In this analysis, if there was a ride to or from a given station within a certain period, that station was assumed to exist and it counted towards the system-wide station count; otherwise it did not. This means that there is a risk of circularity: we infer stations from trips and then try to predict the trips (in part) from the number of stations in the system. The way this risk was mitigated was by using a relatively long period (two weeks) to infer station existence, even though shorter periods yielded a better model fit.
The number of bike stations is an important parameter for predicting the ridership: in all the regressions performed with only linear terms it was the most important feature, i.e, this independent variable had the highest magnitude coefficient. It also was the only parameter that could account for ridership growth over time: if we set aside the COVID year (2020) and the seasonal variation, the BSS trip data exhibits annual growth: the peaks get higher every year starting in 2018. In the present model this growth is driven entirely by the BSS station count parameter.
A functioning bike share system needs to have (a) bicycles and (b) movement of bicycles to the needed stations, sometimes called “rebalancing.” Consider a hypothetical two-station system, where a morning commute direction is from station A to station B. Without movement of bikes or reverse commutes, the maximum number of morning trips is just the number of bikes at station A or the number of available docks at station B: the availability of bikes and docks can be a constraint on the system utilization.
Unfortunately, I am not aware of any source historical data on the number of bikes and docks available at each point in time at each Boston BSS station: such a source could be used to examine bike and dock availability and estimate the impact of (un)availability on the number of bike trips. It may be possible to model bike/dock unavailability from the trip data itself, but this exercise will be left for the future. In the meanwhile, with the help of data collected from station bike availability feeds (thanks to Danny Noenickx for the data) we can see that bike/dock unavailability is not a trivial concern. In the month of December 2023, we see stations experiencing shortages (defined as <1 bike or <1 dock at a station) around 18% of the time, meaning that among all station-minutes in the month of December, almost one fifth had a shortage of one kind or another.
Subway ridership
Another independent variable that improves the model but introduces potential complications is the subway ridership. The source of subway passenger data is the number of paid entries into the Boston subway system; this excludes the Green line street cars where stations are located above ground. Before discussing the specific issues with using the subway data, it may be useful to explicitly state the causal assumptions being made here; these are illustrated in the diagram below where an arrow should be read as “influences” or “affects”:
stateDiagram-v2 direction LR classDef unobserved fill:white,stroke:black state "BSS bike/dock availability" as bss_bike_availability state "BSS size" as bss_size bss_bike_availability --> bss_ridership bss_size --> bss_ridership state "BSS cost" as bss_cost state "Bike infrastructure" as bike_infra bss_cost --> bss_ridership state "Day of week/holiday" as wwh state "local/national events" as other_factors state "Transportation demand" as transpo_demand state "Subway ridership" as subway_ridership Weather --> subway_ridership state "BSS ridership" as bss_ridership Weather --> bss_ridership subway_ridership --> bss_ridership wwh --> transpo_demand other_factors --> transpo_demand transpo_demand --> subway_ridership state "Subway performance" as subway_performance subway_performance --> subway_ridership subway_performance --> bss_ridership state "Population size" as population_size state "Population demographics" as population_demographics population_size --> bss_ridership population_size --> subway_ridership population_demographics --> bss_ridership population_demographics --> subway_ridership state "Other modes' performance" as other_transpo_modes_performance bike_infra --> bss_ridership other_transpo_modes_performance --> bss_ridership other_transpo_modes_performance --> subway_ridership transpo_demand --> bss_ridership class bss_ridership class bss_cost unobserved class bike_infra unobserved class transpo_demand unobserved class subway_performance unobserved class other_factors unobserved class other_transpo_modes_performance unobserved class bss_bike_availability unobserved
The blue blocks in the above diagram are data that were included or considered for inclusion in the model, while the white blocks are those for which I did not have easy access to suitable data.
Looking at the above diagram we can note that there are very few factors that affect BSS ridership that do not also affect the subway ridership under the present assumptions. Furthermore, many parameters that affect only BSS ridership (such as the amount of bike infrastructure in the area, historical bike [un]availability at docks, BSS costs, etc.) are not easy to obtain as historical data sets. The only tractable variable that affects BSS ridership to the exclusion of subway passenger numbers is the BSS station counts data. Given all this, if we see a correlation between the number of subway passengers and the number BSS rides we don’t necessarily know know to interpret this correlation; it could be due to subway utilization truly affecting BSS ridership (for instance when commuters use both subway and BSS bikes) or it could be the case that subway passenger numbers are just a proxy for other unseen factors that affect both subway and BSS utilization, for instance, transportation demand.
Consider transportation usage during rush hour versus late evenings or early mornings, for example: during rush hour there is a lot of demand for transportation, and both subway and bike share systems are heavily utilized. In off hours, both are less utilized. In this example, there is a correlation between subway and BSS usage, but the relationship is not causal.
In fact, there is evidence in the performance of our regressions that the subway data is carrying a signal about transportation demand and that is the behavior of weekday/weekend distinction. Boston BSS sees more trips on weekdays than weekends. Our model reflects this if we exclude the subway entry data and run the regression: in this case, the “weekday” parameter has a large and positive coefficient, while weekend days have the equivalent of a zero coefficient. On the other hand, if we include subway passenger counts as an input to the regression, the weekday parameter has a negative coefficient, which, taken alone, would suggest that there are fewer BSS trips on weekdays, which is not correct. What is correct is that…
- There are a lot more subway passengers on weekdays than on weekends
- The correlation between the number of passengers on the subway and the number of BSS rides is more strongly positive on weekdays than on weekends.
Combined these two points result in a model with a negative weekday coefficient: the subway ridership parameter “overgenerates” BSS rides on weekdays and the weekday parameter corrects that.
Returning to the subway data, there seems to be a clear correlation between the number of riders on the Boston subway system and the number of bike share trips taken, and including the subway data in the regression improves the model fit substantially; however, how much of this is correlation is due to the synergies between subway and BSS transportation mode and how much is simply a reflection of transportation demand isn’t clear without more analysis. In the future I hope to look at other factors such as traffic counts and subway performance in an attempt to disambiguate the meaning of this correlation.
In the meantime, in much of what comes below, the discussion is based on the model without subway ridership data included, because the subway ridership data, while improving the model fit can obscure the relationships between other model parameters.
Weather and Calendar data
Temperature
The temperature data is the most important weather factor for determining daily BSS ridership, and is the second most important overall after BSS station count in most of the regressions attempted. Since temperature varies with seasons and so does the number of stations (stations are added in the warmer months and are removed in the colder months) we might be concerned that the temperature independent variable and the number of stations independent variables are highly correlated. And indeed they are, but only if we include the age of the system in computing the strength of the correlation. For more information see “stations and geography” in the “Gory Details” section.
In terms of temperature itself, if we do not restrict the number of BSS trips model to only linear terms, adding higher-order terms for temperature results in a better fit, but this is not universal: a regression with a quadratic temperature term does basically no better than a linear equation in terms of fitting the ridership data, and a 4th degree equation performs no better than a 3rd degree equation. On the other hand there is a large difference in 2nd and 3rd degree fits between temperature and ridership. The following chart shows different degrees of fitted functions for the temperature/ridership correlation:
From the chart above we can see that through the middle of the temperature range, a linear function would provide a good fit for the data: all four equations have a nearly linear region between +5 and +20 degrees C. Where 3rd and 4th degree function differ is closer to the extremes of the temperature range: there a less steep slope seems to be preferred, something that the 1st or 2nd degree functions are unable to model. This suggests that when it is particularly hot or cold each additional degree of temperature is associated with fewer extra rides taken/not taken than in the middle of the temperature range. The reason 4th and 3rd degree equations of temperature perform similarly even though the tails look quite different is probably because there are very few data points at the extremes of the ranges: only 78 days (out of more than 3000) had temperatures under 5°C or over 35°C. In the middle of the temperature range, where the demand for BSS seems to be quite elastic, each additional degree of temperature (F) generates between 100 and 150 rides, depending on whether we use linear or cubic equation to model the temperature.
Unlike other weather features, temperature does not show a significant weekday versus weekend or holiday split: the difference in coefficient magnitudes is under 8%, which I don’t take to be meaningful. The lack of weekday/weekend distinction for temperature correlation is likely due to the fact that unlike other weather factors (precipitation, sky cover/sunshine) the day-to-day variation in temperature is relatively small.
Precipitation
Precipitation amount is the second most important weather parameter in all the regressions. The importance of this parameter is about 1/3 to 1/4 of the temperature, depending on what other parameters are included in the model. Rain exhibits more of a weekday/weekend split than temperature, with the negative coefficient increasing from between 15% and 20% depending on whether subway ridership data is included in the model. This would be expected if we assume that a larger share of the trips on weekends are recreational: “necessary” trips (such as commuting) are expected to be less impacted by inclement weather.
Cloud cover
Cloud cover is the third most important weather parameter for determining BSS ridership. In regressions with or without subway ridership it carries about 4/5 the importance of precipitation. Like precipitation, the coefficient associated with cloud cover has a larger magnitude on weekdays if we split the data along weekday/weekend lines: the weekend sees about a 20-25% increase in ridership sensitivity to cloud cover.
Wind
Wind seems to be the least important weather parameter for predicting the number of BSS trips: in linear regressions it had 2/3 the importance of the sky cover parameter. This is a bit surprising in such a windy city. Also surprising was the fact that wind showed a reverse weekday/weekend difference from other weather factors: the negative correlation between wind speed and BSS ridership was 20-30% weaker on weekend days than on weekdays. This difference requires further study, as I don’t have any explanation for this.
Other Weather Parameters
Other weather parameters did not turn out to be significant. Relative humidity was not significant in the presence of the cloud cover independent variable. Both the dew point and absolute humidity exhibited a strong correlation with the temperature and were therefore not considered as inputs to the regression.
Calendrical data
One of the most important calendar-related features is weekday/weekend and holiday distinction. The weekday and weekend BSS ridership pattern look different in some respects, though not necessarily in the number of trips: the number of trips per day on weekdays in the data set is only 12% larger than that on the weekends and holidays. The trip distances (mean or median, calculated as the crow flies) are not substantially different either: the median trip distance for both is about 1 mile. On the other hand, trip times on weekdays and weekends are not the same: the median trip on a weekend is almost 30% longer, suggesting that BSS customers on weekends pedal slower or take longer routes between equally distant stations than on weekdays. As we might have guessed, there is a lot more variation in the weekend ride durations: the weekend trip duration standard deviation is 30% larger.
Hourly trip numbers for weekdays and weekends likewise look different: weekdays show bimodal distribution consistent with commuting patterns, while the weekend days see one major peak in the late afternoon:
We have seen above that many of the weather parameters, such as precipitation and cloud cover, have a stronger effect on weekend and holiday ridership than on weekday ridership. The reverse is true for subway ridership: the correlation between the number of subway passengers and the number of BSS riders is about 50% larger on weekdays. This implies that the bike share system experiences more recreational demand than the subway system, and it is true: the Boston subway sees twice as many passengers on weekday days as on weekend and holiday days, while we have seen above that BSS ridership in Boston only increases by about 12% on weekdays.
We can also observe here that there are more same-station rides on weekends, where a same-station rides is defined as one that starts and ends in at the same station. As noted in Blue Bikes Boston visualization, such trips are more likely to be leisure trips than trips that start and end in different stations. In our data set, there were twice as many same-station rides on weekends as there were on weekdays.
In the foregoing, the holidays were grouped with weekends, but those features can be distinct. Another two calendar-related features is the proximity to the closest preceding and the closest following holidays; on December 28th this would be 3 and 3, because December 28th is three days away from Christmas, the preceding holiday, and three days away from New Years Day, the following holiday. It should be pointed out that in the presence of subway passenger counts, calendar variables were either not significant or did not contribute much to the model fit, either using the R-squared metric or RMSE. In the absence of subway passenger data, the weekday coefficient is positive, as expected, and other calendar features become significant. The “holiday” categorical feature, with a large negative coefficient becomes the most important one, even more important than weekend/weekday distinction: in this model it had twice the magnitude of the weekday feature. Number of days to or from a holiday carried the smallest coefficient of all the features, and also contributed little or nothing (0.000 or 0.002 respectively) to the R-squared fit metric. Categorical columns for individual calendar days only added 0.001 to the fit metric and therefore this analysis uses a single weekday feature.
Population
Given that population and demographics may affect bike and BSS ridership (for instance it could be the case that older and younger people are less likely to make trips by bicycles if the bike infrastructure is lacking) it may make sense to include certain population and demographic data in our model. On the other hand, this type of data does not change rapidly, so we might not expect this data to account for much of the changes in ridership numbers. In one attempt to use the population count and median age (from 2020 Census and the American Communities Survey 1-year data) in the regression with non-linear (square and cubic) terms, the R-squared fit metric improved by 0.009. The population sex (male percentage) parameter was not found to be significant. When the same parameters (population and median age) were used in the linear version of the model, the sign of the coefficients was reversed relative to the non-linear model. I do not have an explanation for this, but I took that to mean that population data was not particularly helpful to the model, and as a result the population and demographic data was not used.
Modeling
Using the independent variables discussed above we can construct models that vary along two major dimensions: whether subway data is included, and whether the model includes non-linear parameters. The model with subway data and non-linear parameters shows the best fit to the data, but the other variants are useful for the following reasons:
-
Using linear models we can compare the magnitude of the coefficients of the parameters to each other to see which of the independent variables are more important to predicting the outcome, the number of daily rides in the bike sharing system.
-
Removing the number of subway passengers as an independent variable de-obfuscates the calendar and possibly other independent variables, because the subway ridership itself is not an entirely independent variable.
The goodness of the fit, measured as R-squared metric is as follows:
| Non-linear | Linear | |
|---|---|---|
| With subway | 0.870 | 0.811 |
| Without subway | 0.827 | 0.772 |
The correspondence between actual ridership and predicted numbers from the non-linear model that includes the subway data is as follows:
I hope to explore some of the mismatches between actual and predicted data as well as come COVID-related findings from the model in a future analysis.
Conclusions
This analysis attempted to understand the factors that influence bike share system utilization in Boston by constructing a model of daily ridership. The results suggest that temperature and precipitation have the most effect on ridership, with the latter having a more pronounced effect on weekends and holidays than on weekdays. We have seen that BSS riders sensitivity to temperature is linear in the mid-temperature range, but is more attenuated for higher and lower temperatures. Cloud cover was also found to be an important parameter in predicting the number of trips in the Boston bike share system, while wind and other weather parameters play a much lesser role.
Gory Details
BSS Trips
Historical Boston BSS bike trip data is available from bluebikes.com/system-data. The station status and station information real-time feeds were not used in the model regression; links to these real-time feeds are available from the BlueBikes system data page. The trips data set included trips from 2015 through the end of 2023. This data was cleaned to remove trips lacking starting and ending geo coordinates:
- There were 68 trips with missing (zeroed out) geo coordinates
- There were 17,167 trips that were missing the ride end station and the ride end geo coordinates, all but two of them from April in 2023 or later. This is the same date as when the system was switched to new station id type.
This left 20,404,177 trips in the data set, prior to station filtering.
Stations and Geography
The trip data for Boston BSS includes station identifiers, but these are not entirely reliable: first of all, the same station id sometimes refers to positions very far from each other. Secondly the format of the station ids in the BSS trips dataset changed on April 1, 2023, and therefore there is a discontinuity in station identifiers in the data set. Because of this, I inferred stations from trip coordinates by clustering trip start and end geo coordinates that were within 50 meters of each other into an inferred station.
I excluded two types of stations from the analysis: one set in Salem/Swampscott area and one (station) in Hingham. This is the map of included versus excluded stations; the green dots represent stations included in the analysis while the red ones mark the excluded stations:
The Salem area BSS stations are largely disconnected from the Boston system: out of the 27,348 trips that touched Salem/Swampscott area in the analyzed data set, only 259 (less than 1%) of trips did not start and end in area. In other words, more than 99% of the trips in Salem/Swampscott area remained in that area, without connecting to the rest of the system.
The station in Hinham is at the office of the current sponsor of the Boston bike share system, Blue Cross Blue Shield. Like the Salem/Swampscott stations, it is also quite far away from the rest of Boston BSS stations. All but one of the trips involving the Hingham station started and ended at the Hingham station.
The number of stations in the system actually correlates pretty well with age of the system, and even better if we add the temperature to account for seasonal variation in the number of stations deployed. Nonetheless, even a simple system age correlation accounts for over 90% of the variation in the number of stations:
The system age was not used when regressing the models in the present analysis.
Weather
The weather data came from METAR source for the Boston Logan airport. The data used included hourly observations. Any hourly data points that contained bad or uninterpretable data were ignored. What that means for the data depends on the particular weather parameter, as the treatment of the missing data ends up being dependent on the daily aggregation, which is detailed below.
The source data did not include absolute humidity, therefore absolute humidity was calculated from relative humidity and temperature according to this formula. Cloud coverage was coded as “clear,” “few,” “scattered,” “broken,” or “overcast” for three different cloud levels in the source data set. Additionally the data set contained the code "0" in the higher two cloud layers, which I took to mean “not possible to determine.” The cloud coverage codes were translated into a numeric sky coverage score as follows:
| Code | Numeric value |
|---|---|
| 0 | 0 |
| CLR | 0 |
| FEW | 1.5 |
| SCT | 3.5 |
| BKN | 6 |
| OVC | 8 |
Using this table, cloud coverage at each of the three levels was translated into a numeric code. An overall cloud coverage for that time point was computed by taking the maximum of all levels. For example, if the weather observation reported “Clear” at level one, “Few” at level 2, and “Overcast” at level three, the overall cloud coverage score would have been 8.
The hourly weather data was aggregated into a single daily value as follows:
- Precipitation (mm/hour): sum of hourly values
- Temperature (°C): mean of hourly values
- Wind speed (knots): mean of hourly values
- Relative humidity: maximum of hourly values
- Absolute humidity: mean of hourly values
Subway Ridership
The Boston subway station entry counts came from the MBTA. For this analysis, gated entries from all subway stations were summed together, without regard to the station location.
Population
The population counts and demographic data came from the American Community Survey 1-year Population Estimates, and the 2020 Census Demographic and Housing Characteristics files. Since the ACS 1-year data does not include geographic areas that encompass all and only the geographic zones that have (or have had) BSS stations, I ended up using population data from four County Subdivisions, which are Boston, Cambridge, Somerville, and Newton. This left Brookline, Chelsea, Malden, Medford, Revere, and Watertown out of the population and demographic counts. Quincy, a city for which demographic data is available, was excluded intentionally as it has only one BSS station at its northeastern extremity. This is a map of areas which were included in population analysis with BSS stations overlaid on top:
I also attempted to model a hypothetical student population for the analysis. According to the Boston City Student Housing Trends report, in 2018-2019 academic year there were 94,448 undergraduate students in Boston. I modeled the student departure/arrival by assuming that a) the number of undergraduate students stayed constant between 2015 and 2023, and that b) half of the undergraduate students left Boston before June 1st and arrived back in Boston in the first week of September. Adding student population did not significantly affect the model fit, therefore this data was also left out of the model. Broader population counts were also left out for reasons detailed earlier.
Causality
A note on the causality diagram in the “Subway ridership” section: Technically there should also probably be a causal arrow from “BSS ridership” to “subway ridership”; however, from the standpoint of accounting for the factors governing Boston bike share system utilization, I take this causal path to be negligible: the number of daily subway passengers is about 10 times the number of Boston BSS daily riders on any given day, therefore we do not expect BSS ridership to affect subway ridership enough to affect BSS ridership again.
Regression
The models were regressed via ordinary least squares (OLS) or generalize least squares (GLS) functions using the statmodels Python library; it didn’t matter which regression method was used. Prior to regression, all independent variables were standardized, unless I was trying to compute the exact effect of a small change in the independent variable on the number of rides. Lack of standardization did not affect the model fit.
Acknowledgements
Thanks to Wes Drew, Sam Gerstein, Stacey Moisuk, Jackson Moore-Otto, and Sabina Wolfson for helpful comments and suggestions. Thanks to Danny Noenickx for providing a copy of the stations data and status feeds.
References
T. Thomas, R. Jaarsma, B. Tutert, 2013. Exploring temporal fluctuations of daily cycling demand on Dutch cycle paths: The influence of weather on cycling. Transportaion, 40.
S. Pazdan, M. Kiec, C. D’Agostino, 2021, Impact of environment on bicycle travel demand—Assessment using bikeshare system data. Sustainable Cities and Society 67.