how to segment and get the time between two dates?

We Are Going To Discuss About how to segment and get the time between two dates?. So lets Start this Python Article.

how to segment and get the time between two dates?

  1. How to solve how to segment and get the time between two dates?

    I have been thinking about getting all the data and solving the problem with pandas.
    TLDR: Generate a range of minutes per trip, explode those minutes into rows, and resample those rows into hours to count the minutes per hour:
    import pandas as pd df = pd.read_sql(...) # convert to datetime dtype if not already df['start_date'] = pd.to_datetime(df['start_date']) df['end_date'] = pd.to_datetime(df['end_date']) # fill missing end dates current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now() df['end_date'] = df['end_date'].fillna(current_time) # generate range of minutes per trip df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) (df[['id', 'init_date']].explode('init_date') # explode minutes into rows .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour .mul(60).reset_index(name='seconds')) # convert minutes to seconds
    Output:
    init_date seconds 2022-03-10 01:00:00 720 2022-03-10 02:00:00 4200 2022-03-10 03:00:00 5460 2022-03-10 04:00:00 0 2022-03-10 05:00:00 1080

    Step-by-step breakdown
    Generate a date_range of minutes from start_date to end_date per trip:
    df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
    explode the minutes into rows:
    exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
    resample the rows into hours to count the minutes per hour (× 60 to convert to seconds):
    out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080

    Driver IDs
    If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
    In this case, just change resample to groupby.resample. Select driver_id before exploding, and group by driver_id before resampling.
    As a minimal example, I duplicated the sample data to create two driver_id groups a and b:
    # after preprocessing and creating init_date ... (df[['driver_id', 'init_date']] # now include driver_id .explode('init_date').set_index('init_date') # explode minutes into rows .groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds # seconds # driver_id init_date # a 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080 # b 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080

  2. how to segment and get the time between two dates?

    I have been thinking about getting all the data and solving the problem with pandas.
    TLDR: Generate a range of minutes per trip, explode those minutes into rows, and resample those rows into hours to count the minutes per hour:
    import pandas as pd df = pd.read_sql(...) # convert to datetime dtype if not already df['start_date'] = pd.to_datetime(df['start_date']) df['end_date'] = pd.to_datetime(df['end_date']) # fill missing end dates current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now() df['end_date'] = df['end_date'].fillna(current_time) # generate range of minutes per trip df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) (df[['id', 'init_date']].explode('init_date') # explode minutes into rows .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour .mul(60).reset_index(name='seconds')) # convert minutes to seconds
    Output:
    init_date seconds 2022-03-10 01:00:00 720 2022-03-10 02:00:00 4200 2022-03-10 03:00:00 5460 2022-03-10 04:00:00 0 2022-03-10 05:00:00 1080

    Step-by-step breakdown
    Generate a date_range of minutes from start_date to end_date per trip:
    df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1) # id number_of_trip ... init_date # 1 637hui ... DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00']) # 2 384nfj ... DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00']) # 3 102fiu ... DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00']) # 4 948pvc ... DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00']) # 5 473mds ... DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00']) # 6 103fkd ... DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00']) # 7 905783 ... DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00']) # 8 498wsq ... DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
    explode the minutes into rows:
    exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id'] # init_date # 2022-03-10 01:20:00 1 # 2022-03-10 01:21:00 1 # 2022-03-10 01:22:00 1 # .. # 2022-03-10 05:45:00 8 # 2022-03-10 05:46:00 8 # 2022-03-10 05:47:00 8 # Name: id, Length: 191, dtype: int64
    resample the rows into hours to count the minutes per hour (× 60 to convert to seconds):
    out = exploded.resample('H').count().mul(60).reset_index(name='seconds') # init_date seconds # 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080

    Driver IDs
    If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?
    In this case, just change resample to groupby.resample. Select driver_id before exploding, and group by driver_id before resampling.
    As a minimal example, I duplicated the sample data to create two driver_id groups a and b:
    # after preprocessing and creating init_date ... (df[['driver_id', 'init_date']] # now include driver_id .explode('init_date').set_index('init_date') # explode minutes into rows .groupby('driver_id').resample('H').count() # count rows (minutes) per hour per driver_id .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds # seconds # driver_id init_date # a 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080 # b 2022-03-10 01:00:00 720 # 2022-03-10 02:00:00 4200 # 2022-03-10 03:00:00 5460 # 2022-03-10 04:00:00 0 # 2022-03-10 05:00:00 1080

Solution 1

I have been thinking about getting all the data and solving the problem with pandas.

TLDR: Generate a range of minutes per trip, explode those minutes into rows, and resample those rows into hours to count the minutes per hour:

import pandas as pd
df = pd.read_sql(...)

# convert to datetime dtype if not already
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

# fill missing end dates
current_time = pd.Timestamp('2022-03-10 04:00:00') # or pd.Timestamp.now()
df['end_date'] = df['end_date'].fillna(current_time)

# generate range of minutes per trip
df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)

(df[['id', 'init_date']].explode('init_date')          # explode minutes into rows
   .set_index('init_date')['id'].resample('H').count() # count rows (minutes) per hour
   .mul(60).reset_index(name='seconds'))               # convert minutes to seconds

Output:

init_date            seconds
2022-03-10 01:00:00  720
2022-03-10 02:00:00  4200
2022-03-10 03:00:00  5460
2022-03-10 04:00:00  0
2022-03-10 05:00:00  1080

Step-by-step breakdown

  1. Generate a date_range of minutes from start_date to end_date per trip:

    df['init_date'] = df.apply(lambda x: pd.date_range(x['start_date'], x['end_date'], freq='min', inclusive='left'), axis=1)
    
    # id  number_of_trip  ...  init_date
    # 1   637hui          ...  DatetimeIndex(['2022-03-10 01:20:00', '2022-03-10 01:21:00', ..., '2022-03-10 01:31:00'])
    # 2   384nfj          ...  DatetimeIndex(['2022-03-10 02:18:00', '2022-03-10 02:19:00', ..., '2022-03-10 01:41:00'])
    # 3   102fiu          ...  DatetimeIndex(['2022-03-10 02:10:00', '2022-03-10 02:11:00', ..., '2022-03-10 02:22:00'])
    # 4   948pvc          ...  DatetimeIndex(['2022-03-10 02:40:00', '2022-03-10 02:41:00', ..., '2022-03-10 03:19:00'])
    # 5   473mds          ...  DatetimeIndex(['2022-03-10 02:45:00', '2022-03-10 02:46:00', ..., '2022-03-10 02:57:00'])
    # 6   103fkd          ...  DatetimeIndex(['2022-03-10 03:05:00', '2022-03-10 03:06:00', ..., '2022-03-10 03:27:00'])
    # 7   905783          ...  DatetimeIndex(['2022-03-10 03:12:00', '2022-03-10 03:13:00', ..., '2022-03-10 03:59:00'])
    # 8   498wsq          ...  DatetimeIndex(['2022-03-10 05:30:00', '2022-03-10 05:31:00', ..., '2022-03-10 05:47:00'])
    
  2. explode the minutes into rows:

    exploded = df[['init_date', 'id']].explode('init_date').set_index('init_date')['id']
    
    # init_date
    # 2022-03-10 01:20:00    1
    # 2022-03-10 01:21:00    1
    # 2022-03-10 01:22:00    1
    #                       ..
    # 2022-03-10 05:45:00    8
    # 2022-03-10 05:46:00    8
    # 2022-03-10 05:47:00    8
    # Name: id, Length: 191, dtype: int64
    
  3. resample the rows into hours to count the minutes per hour (× 60 to convert to seconds):

    out = exploded.resample('H').count().mul(60).reset_index(name='seconds')
    
    # init_date            seconds
    # 2022-03-10 01:00:00  720
    # 2022-03-10 02:00:00  4200
    # 2022-03-10 03:00:00  5460
    # 2022-03-10 04:00:00  0
    # 2022-03-10 05:00:00  1080
    

Driver IDs

If I have a column with the driver id, how do I get a segmentation by hours and by driver id without reprocessing?

In this case, just change resample to groupby.resample. Select driver_id before exploding, and group by driver_id before resampling.

As a minimal example, I duplicated the sample data to create two driver_id groups a and b:

# after preprocessing and creating init_date ...

(df[['driver_id', 'init_date']]                        # now include driver_id
    .explode('init_date').set_index('init_date')       # explode minutes into rows
    .groupby('driver_id').resample('H').count()        # count rows (minutes) per hour per driver_id
    .mul(60).rename(columns={'driver_id': 'seconds'})) # convert minutes to seconds

#                                seconds
# driver_id init_date                   
# a         2022-03-10 01:00:00  720
#           2022-03-10 02:00:00  4200
#           2022-03-10 03:00:00  5460
#           2022-03-10 04:00:00  0
#           2022-03-10 05:00:00  1080
# b         2022-03-10 01:00:00  720
#           2022-03-10 02:00:00  4200
#           2022-03-10 03:00:00  5460
#           2022-03-10 04:00:00  0
#           2022-03-10 05:00:00  1080

Original Author tdy Of This Content

Solution 2

This can be done in plain sql (apart from time_bucket function), in a nested sql query:

select 
    interval_start, 
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
    select 
        interval_start,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start) i
    join trips t
        on t.start_date <= i.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
    ) subq
group by interval_start
order by interval_start;

This gives me the following result:

   interval_start    | seconds
---------------------+---------
 2022-03-10 01:00:00 |     720
 2022-03-10 02:00:00 |    4200
 2022-03-10 03:00:00 |    5460
 2022-03-10 04:00:00 |    3600
 2022-03-10 05:00:00 |    4680
 2022-03-10 06:00:00 |       0
(6 rows)

Explanation

Let’s break the query down.

In the innermost query:

select generate_series(
        (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
        (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
        '1 hour'
    ) as interval_start

we generate a series of time interval starts – from minimal start_date value up to the maximal end_time value, truncated to full hours, with 1-hour step. Each boundary can obviously be replaced with an arbitrary datetime. Direct result of this query is the following:

   interval_start
---------------------
 2022-03-10 01:00:00
 2022-03-10 02:00:00
 2022-03-10 03:00:00
 2022-03-10 04:00:00
 2022-03-10 05:00:00
 2022-03-10 06:00:00
(6 rows)

Then, the middle-level query joins this series with the trips table, joining rows if and only if any part of the trip took place during the hour-long interval beginning at the time given by the ‘interval_start’ column:

select interval_start,
    greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
    least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
from (
    -- innermost query
    select generate_series(
        (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
        (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
        '1 hour'
    ) as interval_start
    -- innermost query end
) intervals
join trips t
    on t.start_date <= intervals.interval_start + interval '1 hour' and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start

The two computed values represent respectively:

  • seconds_before_trip_started – number of second passed between the beginning of the interval, and the beginning of the trip (or 0 if the trip begun prior to interval start). This is the time the trip didn’t take place – thus we will be substructing it in the following step
  • seconds_before_trip_ended – number of seconds passed between the end of the interval, and the end of the trip (or 3600 if the trip didn’t end within concerned interval).

The outermost query substracts the two beformentioned fields, effectively computing the time each trip took in each interval, and sums it for all trips, grouping by interval:

select 
    interval_start, 
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
-- middle-level query
    select 
        interval_start,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start) i
    join trips t
        on t.start_date <= i.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= interval_start
-- middle-level query end
    ) subq
group by interval_start
order by interval_start;

Additional grouping

In case we have another column in the table, and what we really need is the segmentation of the above result in respect to that column, we simply need to add it to the appropriate select and group by clauses (optionally to order by clause as well).

Suppose there’s an additional driver_id column in the trips table:

 id | number_of_trip |     start_date      |      end_date       | seconds | driver_id
----+----------------+---------------------+---------------------+---------+-----------
  1 | 637hui         | 2022-03-10 01:20:00 | 2022-03-10 01:32:00 |     720 |         0
  2 | 384nfj         | 2022-03-10 02:18:00 | 2022-03-10 02:42:00 |    1440 |         0
  3 | 102fiu         | 2022-03-10 02:10:00 | 2022-03-10 02:23:00 |     780 |         1
  4 | 948pvc         | 2022-03-10 02:40:00 | 2022-03-10 03:20:00 |    2400 |         1
  5 | 473mds         | 2022-03-10 02:45:00 | 2022-03-10 02:58:00 |     780 |         1
  6 | 103fkd         | 2022-03-10 03:05:00 | 2022-03-10 03:28:00 |    1380 |         2
  7 | 905783         | 2022-03-10 03:12:00 |                     |       0 |         2
  8 | 498wsq         | 2022-03-10 05:30:00 | 2022-03-10 05:48:00 |    1080 |         2

The modified query would look like that:

select
    interval_start,
    driver_id,
    sum(seconds_before_trip_ended - seconds_before_trip_started) as seconds
from (
    select 
        interval_start,
        driver_id,
        greatest(0, extract(epoch from start_date - interval_start)::int) as seconds_before_trip_started,
        least(3600, extract(epoch from coalesce(end_date, '2022-03-10 06:00:00') - interval_start)::int) as seconds_before_trip_ended
    from (
        select generate_series(
            (select min(time_bucket(bucket_width := INTERVAL '1 hour', ts := start_date, "offset" := '0 minutes')) from trips),
            (select max(time_bucket(bucket_width := INTERVAL '1 hour', ts := coalesce(end_date, '2022-03-10 06:00:00'), "offset" := '0 minutes')) from trips),
            '1 hour') as interval_start
    ) intervals
    join trips t
        on t.start_date <= intervals.interval_start + interval '1 hour'
        and coalesce(t.end_date, '2022-03-10 06:00:00') >= intervals.interval_start
) subq
group by interval_start, driver_id
order by interval_start, driver_id;

and give the following result:

   interval_start    | driver_id | seconds
---------------------+-----------+---------
 2022-03-10 01:00:00 |         0 |     720
 2022-03-10 02:00:00 |         0 |    1440
 2022-03-10 02:00:00 |         1 |    2760
 2022-03-10 03:00:00 |         1 |    1200
 2022-03-10 03:00:00 |         2 |    4260
 2022-03-10 04:00:00 |         2 |    3600
 2022-03-10 05:00:00 |         2 |    4680
 2022-03-10 06:00:00 |         2 |       0

Original Author tyrrr Of This Content

Solution 3

Here is what works in sqlite (can be tested):

CREATE TABLE trips(
  id INT PRIMARY KEY NOT NULL,
  start_date TIMESTAMP,
  end_date TIMESTAMP,
  seconds INT
);

INSERT INTO trips(id, start_date, end_date, seconds) VALUES
  (1, '2022-03-10 01:20:00', '2022-03-10 01:32:00', 720),
  (2, '2022-03-10 02:18:00', '2022-03-10 02:42:00', 1440),
  (3, '2022-03-10 02:10:00', '2022-03-10 02:23:00', 780),
  (4, '2022-03-10 02:40:00', '2022-03-10 03:20:00', 2400),
  (5, '2022-03-10 02:45:00', '2022-03-10 02:58:00', 780),
  (6, '2022-03-10 03:05:00', '2022-03-10 03:28:00', 1380),
  (7, '2022-03-10 03:12:00', NULL, 0),
  (8, '2022-03-10 05:30:00', '2022-03-10 05:48:00', 1080);

WITH 
  checked AS (SELECT '2022-03-10 03:00:00' AS start, '2022-03-10 04:00:00' AS end)
SELECT
  SUM(
    IIF(end_date IS NULL, ROUND(MAX(0, (JULIANDAY(checked.end) - JULIANDAY(start_date)) * 24 * 60 * 60)),
      MAX(
        0,
        (JULIANDAY(MIN(checked.end, end_date)) - JULIANDAY(MAX(checked.start, start_date))) / 
          (JULIANDAY(end_date) - JULIANDAY(start_date)) * seconds
      )
    )
  )
FROM trips, checked;

DROP TABLE trips;

The code is simplified and sqlite lacks some features, but I think it will be easy to adapt 🙂

Briefly, the algorithm is:

  • If end_time = NULL, then:
    • Calculate the number of seconds from the start of the trip to the end of the interval
    • Throw away negative values
  • Otherwise:
    • Calculate what part of trip in seconds we need within one interval
    • Throw away negative values
  • Sum the values

This can be done for any interval with a start and end

Original Author qqNade Of This Content

Solution 4

This answer will use staircase, which is built upon pandas and numpy, and operates as part of the pandas ecosystem.

Your data describes intervals, which can be thought of as step functions which have a value of 1 during the interval and 0 otherwise. Using staircase we will add the step functions for each trip together, slice the step function into hour buckets, and then integrate to get the total time for each bucket.

setup

Dataframe with pandas.Timestamp. The trip number not relevant in this solution.

df = pd.DataFrame({
    "start_date": [
        pd.Timestamp("2022-03-10 1:20"),
        pd.Timestamp("2022-03-10 2:18"),
        pd.Timestamp("2022-03-10 2:10"),
        pd.Timestamp("2022-03-10 2:40"),
        pd.Timestamp("2022-03-10 2:45"),
        pd.Timestamp("2022-03-10 3:05"),
        pd.Timestamp("2022-03-10 3:12"),
        pd.Timestamp("2022-03-10 5:30"),
    ],
    "end_date": [
        pd.Timestamp("2022-03-10 1:32"),
        pd.Timestamp("2022-03-10 2:42"),
        pd.Timestamp("2022-03-10 2:23"),
        pd.Timestamp("2022-03-10 3:20"),
        pd.Timestamp("2022-03-10 2:58"),
        pd.Timestamp("2022-03-10 3:28"),
        pd.NaT,
        pd.Timestamp("2022-03-10 5:48"),
    ],
})

solution

import staircase as sc

# create step function
# the Stairs class represents a step function.  It is to staircase as DataFrame is to pandas.
sf = sc.Stairs(df, start="start_date", end="end_date")

# you could visually inspect it if you want
sf.plot(style="hlines")

step function plot

From inspection you can see the maximum concurrent trips is 3. Also note the step function continues on to infinity with a value of 1 – this is because we do not know the end date for one of the records.

# define hourly buckets as pandas PeriodIndex
hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")

# integrate the step function over the hourly buckets
total_per_hour = sf.slice(hour_buckets).integral()

total_per_hour is a pandas.Series of pandas.Timedelta values and indexed by a pandas.IntervalIndex. It looks like this

[2022-03-10 01:00:00, 2022-03-10 02:00:00)   0 days 00:12:00
[2022-03-10 02:00:00, 2022-03-10 03:00:00)   0 days 01:10:00
[2022-03-10 03:00:00, 2022-03-10 04:00:00)   0 days 01:31:00
[2022-03-10 04:00:00, 2022-03-10 05:00:00)   0 days 01:00:00
[2022-03-10 05:00:00, 2022-03-10 06:00:00)   0 days 01:18:00
dtype: timedelta64[ns]

If you want a dataframe format where only the left side of the interval is referenced, and time is given as seconds, then use the following

pd.DataFrame({
    "init_date":total_per_hour.index.left,
    "seconds":total_per_hour.dt.total_seconds().values,
})

to summarise

The solution is

import staircase as sc

hour_buckets = pd.period_range("2022-03-10 1:00", "2022-03-10 5:00", freq="H")
total_per_hour = sc.Stairs(df, start="start_date", end="end_date").slice(hour_buckets).integral()

# optional
total_per_hour = pd.DataFrame({
    "init_date":total_per_hour.index.left,
    "seconds":total_per_hour.dt.total_seconds().values,
})

note 1

In your expected answer you do not have values for 2022-03-10 04:00:00.

This seems inconsistent with the fact that time for trip 905783 (with no end date) is being included for 2022-03-10 03:00:00 but not subsequent hours.

The solution proposed here includes 3600s for 2022-03-10 04:00:00 and 2022-03-10 05:00:00 which is why it differs from the expected solution in the original question.

note 2

If your dataframe has a “driver” column and you want to tally time per driver then the following will work

def make_total_by_hour(df_):
    return sc.Stairs(df_, "start_date", "end_date").slice(hour_buckets).integral()

total_per_hour = (
    df.groupby("driver")
    .apply(make_total_by_hour)
    .melt(ignore_index=False)
    .reset_index()
)

note: I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.

Original Author Riley Of This Content

Conclusion

So This is all About This Tutorial. Hope This Tutorial Helped You. Thank You.

Also Read,

ittutorial team

I am an Information Technology Engineer. I have Completed my MCA And I have 4 Year Plus Experience, I am a web developer with knowledge of multiple back-end platforms Like PHP, Node.js, Python and frontend JavaScript frameworks Like Angular, React, and Vue.

Leave a Comment