User Tools

Site Tools


pandas_dataframe

Table of Contents

Creating a dataframe

Create a dataframe from list of lists

>>> import pandas as pd
>>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]
>>> df = pd.DataFrame(data)
>>> df
   0  1  2  3
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

You can also specify a list of columns for the dataframe.

>>> columns = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame(data, columns=columns)
>>> df
   a  b  c  d
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

To specify the index

>>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY'])
>>> df
    0  1  2  3   4    5
CT  1  2  3  4  15  6.0
NY  1  2  3  4   F  NaN

To make each list into a column, use zip

% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
# create lists
l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"]
l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"]
l3 =[23, 25, 22, 27, 29]
l4 =[69, 54, 73, 70, 74]
# create the dataframe
team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight'])
print(team)
     Name     Code  Age  Weight
0    Amar    Alpha   23      69
1  Barsha    Bravo   25      54
2  Carlos  Charlie   22      73
3  Tanmay    Tango   27      70
4  Misbah     Mike   29      74

Ref:-

tags | row by row

Create a data frame from two lists

On the interpreter

>>> import pandas as pd
>>> df = pd.DataFrame({'letters': ['a', 'b', 'c', 'd'], 'numbers': [1,2,3,4]})
>>> df
  letters  numbers
0       a        1
1       b        2
2       c        3
3       d        4

In the code

% cat create_df_from_lists.py
import pandas as pd
letters = ['a', 'b', 'c']
words = ['apple', 'ball', 'cat']
df = pd.DataFrame(
    {'letter': letters,
     'word': words})

print("letters:\n", letters, "\n")
print("words:\n", words, "\n")
print("df:\n", df, "\n")

Sample run:

 % python3 -u create_df_from_lists.py
letters:
 ['a', 'b', 'c'] 

words:
 ['apple', 'ball', 'cat'] 

df:
   letter   word
0      a  apple
1      b   ball
2      c    cat

Create a dataframe from a list of tuples

In [1]:
import pandas as pd
data = [(0, 1, 0, 1), (1, 0, 1, 1), (0, 1, 1, 1)]
df = pd.DataFrame(data)
df
Out[1]:
   0  1  2  3
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

You can also specify a list of columns for the dataframe.

In [2]:
columns=('a', 'b', 'c', 'd')
df = pd.DataFrame(data, columns=columns)
df
Out[2]:
   a  b  c  d
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

Create a dataframe using StringIO

In [1]: 
import pandas as pd
import io
data = """
Date,Open,High,Low,Close,Adj Close,Volume
2024-01-02,472.16,473.67,470.49,472.65,469.67,123623700
2024-01-03,470.43,471.19,468.17,468.79,465.84,103585900
2024-01-04,468.3,470.96,467.05,467.28,464.33,84232200
2024-01-05,467.49,470.44,466.43,467.92,464.97,86060800
2024-01-08,468.43,474.75,468.3,474.6,471.61,74879100
"""
df = pd.read_csv(io.StringIO(data))

In [2]: 
df
Out[2]: 
         Date    Open    High     Low   Close  Adj Close     Volume
0  2024-01-02  472.16  473.67  470.49  472.65     469.67  123623700
1  2024-01-03  470.43  471.19  468.17  468.79     465.84  103585900
2  2024-01-04  468.30  470.96  467.05  467.28     464.33   84232200
3  2024-01-05  467.49  470.44  466.43  467.92     464.97   86060800
4  2024-01-08  468.43  474.75  468.30  474.60     471.61   74879100

Tested with Python 3.12.3, IPython 8.25.0

Create a dataframe by splitting strings

Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.

Load the data into a pandas series

% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ]
s = pd.Series(strings)
print(s)
0    Netflix 100
1     Costco 200
2    Walmart 500
3     Costco 500
dtype: object

Create a dataframe by splitting the strings into multiple columns

In [2]: 
df = s.str.split(expand=True)
print(df)
         0    1
0  Netflix  100
1   Costco  200
2  Walmart  500
3   Costco  500

Name the columns.

str.split gives everything as strings. Change them to numbers as needed.

In [3]: 
df.columns = ['company', 'value']
df['value'] = df['value'].astype('float')
print(df)
   company  value
0  Netflix  100.0
1   Costco  200.0
2  Walmart  500.0
3   Costco  500.0

See also:

also demonstrates | assign column names to a dataframe

Create a dataframe from a series of lists

tags | convert series with lists to dataframe

df = pd.DataFrame(s.to_list())

For example

In [1]:
import pandas as pd
s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
s
Out[1]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [2]:
df = pd.DataFrame(s.to_list())
df
Out[2]:
   0  1    2    3
0  1  2  3.0  NaN
1  4  5  6.0  7.0
2  8  9  NaN  NaN

If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example

In [5]:
s
Out[5]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [6]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 2
      1 import numpy as np
----> 2 df = pd.DataFrame(np.vstack(s))

File <__array_function__ internals>:200, in vstack(*args, **kwargs)

File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
    294 if not isinstance(arrs, list):
    295     arrs = [arrs]
--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)

File <__array_function__ internals>:200, in concatenate(*args, **kwargs)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4

But with

In [10]:
s = pd.Series([[1, 2, 3], [4, 5, 6]])
s
Out[10]:
0    [1, 2, 3]
1    [4, 5, 6]
dtype: object

In [11]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
df
Out[11]:
   0  1  2
0  1  2  3
1  4  5  6

In [12]:
df = pd.DataFrame(s.to_list())
df
Out[12]:
   0  1  2
0  1  2  3
1  4  5  6

See also:

Create a dataframe from a bunch of variables

import pandas as pd
df = pd.DataFrame({
  'key': ['var1', 'var2', 'var3'],
  'value': [var1, var2, var3]
})

For example

$ ipython

In [1]:
year = 2023; month = 6; date = 15

In [2]:
import pandas as pd
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [3]:
df
Out[3]:
     key  value
0   year   2023
1  month      6
2   date     15

In [4]:
df.dtypes
Out[4]:
key      object
value     int64
dtype: object

It works even if the variables are not of the same type.

In [5]:
year = 2023; month = 'June'; date = 15

In [6]:
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [7]:
df
Out[7]:
     key value
0   year  2023
1  month  June
2   date    15

In [8]:
df.dtypes
Out[8]:
key      object
value    object
dtype: object

Tested with Python 3.11.3, IPython 8.12.0

split columns

lookup value

To pick the first value in column 'A' for rows where column B is FOO

df.loc[df['B'] == 'FOO', 'A'].iloc[0]

Example:

$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
print(df)
    A  B
0  p1  1
1  p2  3
2  p3  3
3  p4  2

In [2]:
df.loc[df['B'] == 3, 'A']
Out[2]:
1    p2
2    p3
Name: A, dtype: object

In [3]:
df.loc[df['B'] == 3, 'A'].iloc[0]
Out[3]:
'p2'

search tags | value of one column when another column equals something

Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas

view the index, type, length and name of the series

s.index
s.dtype
s.size
s.name

unique types found in a object series

First, use the .apply method with the type function to get back a Series that has the type of every member. Then, chain the .unique method onto the result.

s.apply(type).unique()

a vs b

shape vs size

Shape gives the number of elements in each dimension. Size gives the total number of elements.

In [1]:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
                   'col3': [5, 6]})

In [2]:
df
Out[2]:
   col1  col2  col3
0     1     3     5
1     2     4     6

In [3]:
df.shape
Out[3]:
(2, 3)

In [4]:
df.size
Out[4]:
6

using | pandas 1.1.2, python 3.8.18

Ref:-

See also:

Missing data

stringify nans

>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a
0    foo
1    NaN
2    bar
3    NaN
dtype: object
>>> a.fillna('baz')
0    foo
1    baz
2    bar
3    baz
dtype: object
>>> a
0    foo
1    NaN
2    bar
3    NaN
dtype: object

replace a string with NaN

% python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a.isnull()
0    False
1     True
2    False
3     True
dtype: bool
>>> b = a.fillna('baz')
>>> b
0    foo
1    baz
2    bar
3    baz
dtype: object
>>> c = b.replace('baz', np.nan, regex=True)
>>> c
0    foo
1    NaN
2    bar
3    NaN
dtype: object
>>> c.isnull()
0    False
1     True
2    False
3     True
dtype: bool

demonstrates | how to change a string to NaN

See also:- https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas

use values in another column as a fallback

$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'old' :['m', 'n', 'o', 'p'], 'new': ['a', None, 'b', None]})
df
Out[1]:
  old   new
0   m     a
1   n  None
2   o     b
3   p  None

In [2]:
df['new'] = df['new'].fillna(df['old'])
df
Out[2]:
  old new
0   m   a
1   n   n
2   o   b
3   p   p

using | Python 3.11.8, pandas 2.1.4

Tasks

read_csv()

read all columns as strings

# read_csv preserves the white space. So strip it out.
df = pd.read_csv(file_name, dtype='str')\
    .rename(columns=lambda x: x.strip())\
    .apply(lambda x: x.str.strip())

[{demonstrates: strip spaces in the column names, strip all elements},
{tags: read_csv as all strings}]

check if

check if some columns exist in a dataframe

approach 1:

if set(['A', 'C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']

approach 2:

Instead of set([]), you can also use {} to create sets

if {'A', 'C'}.issubset(df.columns):
  ...

approach 3:

You can also do this using a generator comprehension

if all(item in df.columns for item in ['A', 'C']):
  ...

Ref:

matches to coplays

Consider the dataframe

   match_id player_id
0         0         a
1         0         b
2         0         c
3         1         a
4         1         b
5         2         c

which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2.

We want to get a coplay count

  player_id1 player_id2  size
0          a          a     2
1          a          b     2
2          a          c     1
3          b          a     2
4          b          b     2
5          b          c     1
6          c          a     1
7          c          b     1
8          c          c     2

which shows

  • the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game)
  • the number of games a player played if player_id1 $ == $ player_id2

Solution

difference between

convert stuff

convert column names to lower case and replace space with underscore

df.columns = df.columns.str.lower().str.replace(' ', '_')

Example:

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [3]:
df
Out[3]:
  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

convert column names to lower case

Approach 1:

df.columns = df.columns.str.lower()

Approach 2:

df.rename(columns=lambda x: x.lower(), inplace=True)

Notes:

  • I prefer approach1 - simple syntax, easy to remember.

Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, …, suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.

Example (using approach 1):

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower()

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

Example (using approach 2):

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.rename(columns=lambda x: x.lower(), inplace=True)

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

convert all categorical columns to string columns

df.loc[:, df.dtypes == "category"] = df.select_dtypes(
    ["category"]
    ).apply(lambda x: x.astype("object"))

Ref:- https://stackoverflow.com/a/56944992

convert all columns in a dataframe to string

df.astype(str)

Example:

In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]})
   ...: print(df)
     a    b
0  648  175
1  435  389

In [2]: df.dtypes
Out[2]:
a    int64
b    int64
dtype: object

In [3]: df2 = df.astype(str)
   ...: df2.dtypes
Out[3]:
a    object
b    object
dtype: object

In [4]: print(df2)
     a    b
0  648  175
1  435  389

convert two columns of dataframe into a dictionary

df.set_index('id')['value'].to_dict()

Example

In [1]:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]})
df
Out[1]:
   id  value
0   0    1.7
1   1    5.2
2   2    4.4

In [2]:
df.set_index('id')['value'].to_dict()
Out[2]:
{0: 1.7, 1: 5.2, 2: 4.4}

Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary

convert % to numbers

tags | convert percent symbol in strings to numbers

In [2]:
import pandas as pd
df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']})
df
Out[2]:
     foo
0   1.8%
1  -2.5%
2   0.7%
3   3.2%

In [3]:
df['foo'] = df['foo'].str.rstrip('%').astype(float)/100

In [4]:
df
Out[4]:
     foo
0  0.018
1 -0.025
2  0.007
3  0.032

Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2

Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv

dummy

json to dataframe

From to Use
JSON DataFrame pandas.read_json()
DataFrame JSON
  • df.to_json() or
  • df.to_json(orient='records')
JSON Python object json.loads()
Python object JSON json.dumps()

Sample code nbviewer/github/KamarajuKusumanchi

Ref:-

sum if columns exist

tags | sum if column exists

$ ipython
In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan],
                   'ID2':[7.0, 7.7, np.nan, 11.3, np.nan],
                   'ID5':[10, 15, 3.1, 2.2, np.nan]})
df
Out[1]:
    ID1   ID2   ID5
0  15.3   7.0  10.0
1  12.1   7.7  15.0
2  13.2   NaN   3.1
3  10.0  11.3   2.2
4   NaN   NaN   NaN

In [2]:
List_ID = ['ID1','ID2','ID3']
cols = df.columns[df.columns.isin(List_ID)]
cols
Out[2]:
Index(['ID1', 'ID2'], dtype='object')

In [3]:
res = df[cols].sum(axis=1)
res
Out[3]:
0    22.3
1    19.8
2    13.2
3    21.3
4     0.0
dtype: float64

Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1

Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas

get unique values in a series as another series

  • Series.unique() returns a numpy.ndarray
  • Series.drop_duplicates() returns a Series
    • Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.
In [1]:
import pandas as pd
df = pd.DataFrame({
    'a': [1,2,3,3],
    'b': ['foo', 'foo', 'bar', 'bar']})
df
Out[1]:
   a    b
0  1  foo
1  2  foo
2  3  bar
3  3  bar

In [2]:
df['a'].unique()
Out[2]:
array([1, 2, 3], dtype=int64)

In [3]:
type(df['a'].unique())
Out[3]:
numpy.ndarray

In [4]:
df['a'].drop_duplicates()
Out[4]:
0    1
1    2
2    3
Name: a, dtype: int64

In [5]:
type(df['a'].drop_duplicates())
Out[5]:
pandas.core.series.Series

In [6]:
df['a'].drop_duplicates(keep='last')
Out[6]:
0    1
1    2
3    3
Name: a, dtype: int64

In [7]:
type(df['a'].drop_duplicates(keep='last'))
Out[7]:
pandas.core.series.Series

Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427

clean up column names

# Strip out the white space from both ends of the column names
df.columns = df.columns.str.strip()
# Remove white space from both ends of the column names
# -> convert them to lower case
# -> replace space with an underscore
# -> remove open and close parenthesis
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

round a single column

How to round a single column in pandas without affecting other columns? For example, given

df:

  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

How to get

  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

Solution:

df['value1'] = df['value1'].round()

Details:

In [1]: 
import pandas as pd
df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'],
                   'value1': [1.12, 1.50, 0.10, 3.30, 4.80],
                   'value2': [1.3, 2.5, 0.0, -1.0, -1.0]})
df
Out[1]: 
  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

In [2]: 
df['value1'] = df['value1'].round()
df
Out[2]: 
  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

relative percentage difference

Code:

% cat rel_pct_diff.py
import pandas as pd
import numpy as np
print('Using pandas', pd.__version__, ', numpy', np.__version__)
df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]})
print('orignal df')
print(df)
df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100
df['rel_pct_diff2'] = ((df['new'].divide(
    df['old'].where(df['old'] != 0, np.nan))) - 1) * 100
print(df)

Run:

% python ./rel_pct_diff.py
Using pandas 1.1.3 , numpy 1.19.1
orignal df
   old  new
0    2  2.1
1    1  1.1
2    0  0.1
3    5  4.9
4    0  0.0
   old  new  rel_pct_diff1  rel_pct_diff2
0    2  2.1            5.0            5.0
1    1  1.1           10.0           10.0
2    0  0.1            inf            NaN
3    5  4.9           -2.0           -2.0
4    0  0.0            NaN            NaN

how to do XYZ

start column numbers from 1

By default, column numbers start from 0.

% ipython
Python 3.8.3 (default, Jul  2 2020, 16:21:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: import numpy as np
   ...: n = np.ones(10).reshape(2,5)
   ...: print(n)
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

In [2]: df = pd.DataFrame(n)
   ...: print(df)
     0    1    2    3    4
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

To change them to start from 1

In [3]: df.columns = range(1, df.shape[1]+1)
   ...: print(df)
     1    2    3    4    5
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

where df.shape[1] gives the number of columns in a dataframe.

In [4]: print(df.shape[1])
5

read everything as string

dollars two decimal places

Use float_format=“%.2f”

df.to_csv(file_path, float_format="%.2f")

number of days between two YYYYMMDD string columns

tags | number of days between two dates, convert YYYYMMDD to date

a_dt = pd.to_datetime(df['a'], format='%Y%m%d')
b_dt = pd.to_datetime(df['b'], format='%Y%m%d')
df['days'] = (a_dt - b_dt).dt.days

See also:

Select columns from a dataframe only if they exist

df[df.columns.intersection(set(['list', 'of', 'cols']))]

Example:

$ ipython
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(columns=[1,2,3,4])
df
Out[1]:
Empty DataFrame
Columns: [1, 2, 3, 4]
Index: []

In [2]:
df[df.columns.intersection(set([1, 2, 2, 5]))]
Out[2]:
Empty DataFrame
Columns: [1, 2]
Index: []

In [3]:
pd.__version__
Out[3]:
'1.2.1'

See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist

migrate sql case when else statements to build a new column

initialize column with strings that range from 1 to N

To set it to '1', '2', …, 'N'

df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]

To set it to 'A_1', 'A_2', …, 'A_N'

df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]

excel shows alternate blank lines in csv files written using to_csv()

Call csv with lineterminator=“\n”. For example

df.to_csv(path, index=False, lineterminator="\n")

tags | to_csv files have an extra line ending character at the end, to_csv line endings

Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

openpyxl vs. xlrd

context | read excel files, pd.read_excel

openpyxl supports newer Excel file formats. xlrd supports old-style Excel files (.xls) 1).

xlrd used to support .xlsx files at some point. But it dropped the support starting from 2.0.0 (2020-12-11) 2).

When calling pd.read_excel(), if engine=None then a waterfall logic is used to determine the engine. This logic is described in https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html .

Ref:

Missing values

number of missing values in a dataframe

  • df.isna().sum().sum() - total number of missing values in the entire dataframe.
  • df.isna().sum() - number of missing values per column
  • df.isna() - boolean dataframe with True if value is missing, false otherwise.
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]})
df
Out[1]:
     A    B
0  3.0  NaN
1  NaN  5.0
2  5.0  NaN
3  7.0  4.0

In [2]:
df.isna().sum().sum()
Out[2]:
3

In [3]:
df.isna().sum()
Out[3]:
A    1
B    2
dtype: int64

In [4]:
df.isna()
Out[4]:
       A      B
0  False   True
1   True  False
2  False   True
3  False  False

fill missing values using a lookup table

Q. Fill the NaNs in

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

using the lookup table

                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

so that we get

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ans: Get the missing locations using isna() and then use map with set_index.

% ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'title': ['Executive Secretary', 'Administrative Officer',
              'Recruiting Manager', 'Senior Editor', 'Media Manager I'],
    'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]})
df
Out[1]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

In [2]: 
lookup = pd.DataFrame({
    'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'],
    'industry': ['Retail', 'Manufacturing', 'Health']})
lookup
Out[2]: 
                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

In [3]: 
missing = df['industry'].isna()

In [4]: 
df.loc[missing, 'industry'] = df.loc[missing, 'title'].map(
    lookup.set_index('title')['industry'])
df
Out[4]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.

Sample usage

read_csv usage

  • Read a csv file that has no headers and interpret all columns as strings
pd.read_csv(input_file, dtype=str, sep='|', header=None)
  • drop duplicate lines
pd.read_csv(input_file).drop_duplicates()

using astype

df = df.astype({'col_a': np.int64,
                'col_b': np.int64})
df['col_a'] = df['col_a'].astype(np.int64)

stackoverflow questions I came across

normalize data within each group

find many to one mappings

<place holder>

pivot and sum

convert pivot table output to dataframe

Sample code

table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'],
                                         columns=['class_name'], aggfunc=np.sum)
table.columns = table.columns.droplevel(0)    # removes the values from level 0
table.columns.name = None                     # removes the columns as the column name
table = table.reset_index()                   # changes index to columns

Ref:-

normalize data

How to convert

$ cat data.csv
node_id,sim_id,portfolio_risk_total
11,1,10
11,2,20
11,3,30
11,4,40
22,1,100
22,2,200
22,3,300
22,4,400
33,1,1000
33,2,2000
33,3,3000
33,4,4000

to

$ cat out.csv
sim_id,11,22,33
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000

Answer:

The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .

In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df
Out[1]:
        node_id  sim_id  portfolio_risk_total
0            11       1                    10
1            11       2                    20
2            11       3                    30
3            11       4                    40
4            22       1                   100
5            22       2                   200
6            22       3                   300
7            22       4                   400
8            33       1                  1000
9            33       2                  2000
10           33       3                  3000
11           33       4                  4000

In [2]:
table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total')
table
Out[2]:
node_id      11   22    33
sim_id
1            10  100  1000
2            20  200  2000
3            30  300  3000
4            40  400  4000

In [3]:
table.columns.name = None
table = table.reset_index()
table
Out[3]:
   sim_id  11   22    33
0       1  10  100  1000
1       2  20  200  2000
2       3  30  300  3000
3       4  40  400  4000

In [4]:
table.to_csv('out.csv', index=False)

dummy

user guide

url title notes
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html Reshaping and pivot tables tags: stacked or record format, denormalized to normalized data

difference between

What is the difference between read_table and read_csv?

read_csv read_table
default separator “,” “\t”

The code for read_csv() and read_table() is in https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py . The functions are all same except for the default separator.

read_csv read_table
relevant code
    kwds_defaults = _refine_defaults_read(
        dialect,
        delimiter,
        engine,
        sep,
        on_bad_lines,
        names,
        defaults={"delimiter": ","},
        dtype_backend=dtype_backend,
    )
    kwds.update(kwds_defaults)
    kwds_defaults = _refine_defaults_read(
        dialect,
        delimiter,
        engine,
        sep,
        on_bad_lines,
        names,
        defaults={"delimiter": "\t"},
        dtype_backend=dtype_backend,
    )
    kwds.update(kwds_defaults)

last checked | 2024-07-20

pandas_dataframe.txt · Last modified: 2024/11/27 23:36 by raju