We Are Going To Discuss About how to segment and get the time between two dates?. So lets Start this Python Article.
how to segment and get the time between two dates?
- How to solve how to segment and get the time between two dates?
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: Generate a range of minutes per trip,explode
those minutes into rows, andresample
those rows into hours tocount
the minutes per hour:import pandas as pd df = pd.read_sql(...) # convert to datetime dtype if not already df['start_date'] = pd.to_datetime(df['start_date']) df['end_date'] = pd.to_datetime(df['end_date']) # fill missing end dates current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now() df['end_date'] = df['end_date'].fillna(current_time) # generate range of minutes per trip df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) (df[['id', 'init_date']].explode('init_date') # explode minutes into rows .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour .mul(60).reset_index(name='seconds')) # convert minutes to seconds
Output:init_date seconds 2022-03-10 01:00:00 720 2022-03-10 02:00:00 4200 2022-03-10 03:00:00 5460 2022-03-10 04:00:00 0 2022-03-10 05:00:00 1080
Step-by-step breakdown
Generate adate_range
of minutes fromstart_date
toend_date
per trip:df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
explode
the minutes into rows:exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
resample
the rows into hours tocount
the minutes per hour (× 60 to convert to seconds):out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
Driver IDs
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
In this case, just changeresample
togroupby.resample
. Selectdriver_id
before exploding, and group bydriver_id
before resampling.
As a minimal example, I duplicated the sample data to create twodriver_id
groupsa
andb
:# after preprocessing and creating init_date ... (df[['driver_id', 'init_date']] # now include driver_id .explode('init_date').set_index('init_date') # explode minutes into rows .groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds # seconds # driver_id init_date # a 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080 # b 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
- how to segment and get the time between two dates?
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: Generate a range of minutes per trip,explode
those minutes into rows, andresample
those rows into hours tocount
the minutes per hour:import pandas as pd df = pd.read_sql(...) # convert to datetime dtype if not already df['start_date'] = pd.to_datetime(df['start_date']) df['end_date'] = pd.to_datetime(df['end_date']) # fill missing end dates current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now() df['end_date'] = df['end_date'].fillna(current_time) # generate range of minutes per trip df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) (df[['id', 'init_date']].explode('init_date') # explode minutes into rows .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour .mul(60).reset_index(name='seconds')) # convert minutes to seconds
Output:init_date seconds 2022-03-10 01:00:00 720 2022-03-10 02:00:00 4200 2022-03-10 03:00:00 5460 2022-03-10 04:00:00 0 2022-03-10 05:00:00 1080
Step-by-step breakdown
Generate adate_range
of minutes fromstart_date
toend_date
per trip:df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
explode
the minutes into rows:exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
resample
the rows into hours tocount
the minutes per hour (× 60 to convert to seconds):out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
Driver IDs
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
In this case, just changeresample
togroupby.resample
. Selectdriver_id
before exploding, and group bydriver_id
before resampling.
As a minimal example, I duplicated the sample data to create twodriver_id
groupsa
andb
:# after preprocessing and creating init_date ... (df[['driver_id', 'init_date']] # now include driver_id .explode('init_date').set_index('init_date') # explode minutes into rows .groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds # seconds # driver_id init_date # a 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080 # b 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
Solution 1
I have been thinking about getting all the data and solving the problem with pandas.
TLDR: Generate a range of minutes per trip, explode
those minutes into rows, and resample
those rows into hours to count
the minutes per hour:
import pandas as pd
df = pd.read_sql(...)
# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)
# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
(df[['id', 'init_date']].explode('init_date') # explode minutes into rows
.set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
.mul(60).reset_index(name='seconds')) # convert minutes to seconds
Output:
init_date seconds
2022-03-10 01:00:00 720
2022-03-10 02:00:00 4200
2022-03-10 03:00:00 5460
2022-03-10 04:00:00 0
2022-03-10 05:00:00 1080
Step-by-step breakdown
-
Generate a
date_range
of minutes fromstart_date
toend_date
per trip:df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
-
explode
the minutes into rows:exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
-
resample
the rows into hours tocount
the minutes per hour (× 60 to convert to seconds):out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080
Driver IDs
If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
In this case, just change resample
to groupby.resample
. Select driver_id
before exploding, and group by driver_id
before resampling.
As a minimal example, I duplicated the sample data to create two driver_id
groups a
and b
:
# after preprocessing and creating init_date ...
(df[['driver_id', 'init_date']] # now include driver_id
.explode('init_date').set_index('init_date') # explode minutes into rows
.groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id
.mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds
# seconds
# driver_id init_date
# a 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
# b 2022-03-10 01:00:00 720
# 2022-03-10 02:00:00 4200
# 2022-03-10 03:00:00 5460
# 2022-03-10 04:00:00 0
# 2022-03-10 05:00:00 1080
Original Author tdy Of This Content
Solution 2
This can be done in plain sql (apart from time_bucket
function), in a nested sql query:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
) subq
group by interval_start
order by interval_start;
This gives me the following result:
interval_start | seconds
---------------------+---------
2022-03-10 01:00:00 | 720
2022-03-10 02:00:00 | 4200
2022-03-10 03:00:00 | 5460
2022-03-10 04:00:00 | 3600
2022-03-10 05:00:00 | 4680
2022-03-10 06:00:00 | 0
(6 rows)
Explanation
Let’s break the query down.
In the innermost query:
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
we generate a series of time interval starts – from minimal start_date
value up to the maximal end_time
value, truncated to full hours, with 1-hour step. Each boundary can obviously be replaced with an arbitrary datetime. Direct result of this query is the following:
interval_start
---------------------
2022-03-10 01:00:00
2022-03-10 02:00:00
2022-03-10 03:00:00
2022-03-10 04:00:00
2022-03-10 05:00:00
2022-03-10 06:00:00
(6 rows)
Then, the middle-level query joins this series with the trips
table, joining rows if and only if any part of the trip took place during the hour-long interval beginning at the time given by the ‘interval_start’ column:
select interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
-- innermost query
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour'
) as interval_start
-- innermost query end
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
The two computed values represent respectively:
seconds_before_trip_started
– number of second passed between the beginning of the interval, and the beginning of the trip (or 0 if the trip begun prior to interval start). This is the time the trip didn’t take place – thus we will be substructing it in the following stepseconds_before_trip_ended
– number of seconds passed between the end of the interval, and the end of the trip (or 3600 if the trip didn’t end within concerned interval).
The outermost query substracts the two beformentioned fields, effectively computing the time each trip took in each interval, and sums it for all trips, grouping by interval:
select
interval_start,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
select
interval_start,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start) i
join trips t
on t.start_date <= i.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
) subq
group by interval_start
order by interval_start;
Additional grouping
In case we have another column in the table, and what we really need is the segmentation of the above result in respect to that column, we simply need to add it to the appropriate select
and group by
clauses (optionally to order by
clause as well).
Suppose there’s an additional driver_id
column in the trips
table:
id | number_of_trip | start_date | end_date | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
1 | 637hui | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 | 720 | 0
2 | 384nfj | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 | 1440 | 0
3 | 102fiu | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 | 780 | 1
4 | 948pvc | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 | 2400 | 1
5 | 473mds | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 | 780 | 1
6 | 103fkd | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 | 1380 | 2
7 | 905783 | 2022-03-10 03:12:00 | | 0 | 2
8 | 498wsq | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 | 1080 | 2
The modified query would look like that:
select
interval_start,
driver_id,
sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
select
interval_start,
driver_id,
greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
select generate_series(
(select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
(select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
'1 hour') as interval_start
) intervals
join trips t
on t.start_date <= intervals.interval_start + interval '1 hour'
and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;
and give the following result:
interval_start | driver_id | seconds
---------------------+-----------+---------
2022-03-10 01:00:00 | 0 | 720
2022-03-10 02:00:00 | 0 | 1440
2022-03-10 02:00:00 | 1 | 2760
2022-03-10 03:00:00 | 1 | 1200
2022-03-10 03:00:00 | 2 | 4260
2022-03-10 04:00:00 | 2 | 3600
2022-03-10 05:00:00 | 2 | 4680
2022-03-10 06:00:00 | 2 | 0
Original Author tyrrr Of This Content
Solution 3
Here is what works in sqlite (can be tested):
CREATE TABLE trips(
id INT PRIMARY KEY NOT NULL,
start_date TIMESTAMP,
end_date TIMESTAMP,
seconds INT
);
INSERT INTO trips(id, start_date, end_date, seconds) VALUES
(1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
(2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
(3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
(4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
(5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
(6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
(7, '2022-03-10 03:12:00', NULL, 0),
(8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);
WITH
checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
SUM(
IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
MAX(
0,
(JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) /
(JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
)
)
)
FROM trips, checked;
DROP TABLE trips;
The code is simplified and sqlite lacks some features, but I think it will be easy to adapt 🙂
Briefly, the algorithm is:
- If end_time = NULL, then:
- Calculate the number of seconds from the start of the trip to the end of the interval
- Throw away negative values
- Otherwise:
- Calculate what part of trip in seconds we need within one interval
- Throw away negative values
- Sum the values
This can be done for any interval with a start and end
Original Author qqNade Of This Content
Solution 4
This answer will use staircase, which is built upon pandas and numpy, and operates as part of the pandas ecosystem.
Your data describes intervals, which can be thought of as step functions which have a value of 1 during the interval and 0 otherwise. Using staircase
we will add the step functions for each trip together, slice the step function into hour buckets, and then integrate to get the total time for each bucket.
setup
Dataframe with pandas.Timestamp
. The trip number not relevant in this solution.
df = pd.DataFrame({
"start_date": [
pd.Timestamp("2022-03-10 1:20"),
pd.Timestamp("2022-03-10 2:18"),
pd.Timestamp("2022-03-10 2:10"),
pd.Timestamp("2022-03-10 2:40"),
pd.Timestamp("2022-03-10 2:45"),
pd.Timestamp("2022-03-10 3:05"),
pd.Timestamp("2022-03-10 3:12"),
pd.Timestamp("2022-03-10 5:30"),
],
"end_date": [
pd.Timestamp("2022-03-10 1:32"),
pd.Timestamp("2022-03-10 2:42"),
pd.Timestamp("2022-03-10 2:23"),
pd.Timestamp("2022-03-10 3:20"),
pd.Timestamp("2022-03-10 2:58"),
pd.Timestamp("2022-03-10 3:28"),
pd.NaT,
pd.Timestamp("2022-03-10 5:48"),
],
})
solution
import staircase as sc
# create step function
# the Stairs class represents a step function. It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")
# you could visually inspect it if you want
sf.plot(style="hlines")
From inspection you can see the maximum concurrent trips is 3. Also note the step function continues on to infinity with a value of 1 – this is because we do not know the end date for one of the records.
# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()
total_per_hour
is a pandas.Series
of pandas.Timedelta
values and indexed by a pandas.IntervalIndex
. It looks like this
[2022-03-10 01:00:00, 2022-03-10 02:00:00) 0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00) 0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00) 0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00) 0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00) 0 days 01:18:00
dtype: timedelta64[ns]
If you want a dataframe format where only the left side of the interval is referenced, and time is given as seconds, then use the following
pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
to summarise
The solution is
import staircase as sc
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()
# optional
total_per_hour = pd.DataFrame({
"init_date":total_per_hour.index.left,
"seconds":total_per_hour.dt.total_seconds().values,
})
note 1
In your expected answer you do not have values for 2022-03-10 04:00:00
.
This seems inconsistent with the fact that time for trip 905783
(with no end date) is being included for 2022-03-10 03:00:00
but not subsequent hours.
The solution proposed here includes 3600s for 2022-03-10 04:00:00
and 2022-03-10 05:00:00
which is why it differs from the expected solution in the original question.
note 2
If your dataframe has a “driver” column and you want to tally time per driver then the following will work
def make_total_by_hour(df_):
return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()
total_per_hour = (
df.groupby("driver")
.apply(make_total_by_hour)
.melt(ignore_index=False)
.reset_index()
)
note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
Original Author Riley Of This Content
Conclusion
So This is all About This Tutorial. Hope This Tutorial Helped You. Thank You.