===== Get the first non null value in each column ===== ==== Task ==== Get the first non null value in each column. Corner cases: * If a column is all NaNs, return a NaN. For example, given jim joe jolie jack 0 1.0 NaN NaN 0 NaN 2.0 NaN 1 3.0 NaN NaN 1 NaN 4.0 NaN We want jim joe jolie jack 0 1.0 2.0 NaN 1 3.0 4.0 NaN ==== Solution ==== $ ipython In [1]: import pandas as pd import numpy as np df = pd.DataFrame({'jim': [0, 0, 1, 1], 'joe': [1, np.nan, 3, np.nan], 'jolie': [np.nan, 2, np.nan, 4], 'jack': [np.nan, np.nan, np.nan, np.nan]}) df Out[1]: jim joe jolie jack 0 0 1.0 NaN NaN 1 0 NaN 2.0 NaN 2 1 3.0 NaN NaN 3 1 NaN 4.0 NaN In [2]: def get_first_non_nan(s): values = s.loc[~s.isnull()] value = values.iloc[0] if not values.empty else np.nan return value In [3]: df.groupby('jim').agg(get_first_non_nan) Out[3]: joe jolie jack jim 0 1.0 2.0 NaN 1 3.0 4.0 NaN In [4]: df.groupby('jim').agg(get_first_non_nan).reset_index() Out[4]: jim joe jolie jack 0 0 1.0 2.0 NaN 1 1 3.0 4.0 NaN ==== meta ==== Used | Python 3.9.4 and IPython 7.22.0 demonstrates | apply a function on each column of a dataframe after doing a groupby ===== Get the first value not equal to a number ===== ==== Use case ==== One downside of using np.nan to denote a missing value is that an integer column of a dataframe gets "promoted" to a floating point column even if there is a single np.nan in it. One work around is to use a specific integer to denote a missing value. For example, if we expect all integers to be positive, we can use -9999 to denote a missing value. Let's call this special integer NAN_INT. ==== Task ==== Get the first value not equal to NAN_INT Corner cases: * If a column is all NAN_INTs, return a NAN_INT For example, give jim joe jolie jack 0 1 -9999 -9999 0 -9999 2 -9999 1 3 -9999 -9999 1 -9999 4 -9999 We want jim joe jolie jack 0 1 2 -9999 1 3 4 -9999 ==== Solution ==== $ ipython In [1]: import pandas as pd import numpy as np NAN_INT = -9999 df = pd.DataFrame({'jim': [0, 0, 1, 1], 'joe': [1, NAN_INT, 3, NAN_INT], 'jolie': [NAN_INT, 2, NAN_INT, 4], 'jack': [NAN_INT, NAN_INT, NAN_INT, NAN_INT]}) df Out[1]: jim joe jolie jack 0 0 1 -9999 -9999 1 0 -9999 2 -9999 2 1 3 -9999 -9999 3 1 -9999 4 -9999 In [2]: df.dtypes Out[2]: jim int64 joe int64 jolie int64 jack int64 dtype: object In [3]: def get_first_non_missing(s): NAN_INT = -9999 values = s.loc[(s != NAN_INT)] value = values.iloc[0] if not values.empty else NAN_INT return value In [4]: df.groupby('jim').agg(get_first_non_missing) Out[4]: joe jolie jack jim 0 1 2 -9999 1 3 4 -9999 In [5]: df.groupby('jim').agg(get_first_non_missing).reset_index() Out[5]: jim joe jolie jack 0 0 1 2 -9999 1 1 3 4 -9999 ==== meta ==== Used | Python 3.9.4, IPython 7.22.0