pandas_dataframe []

Creating a dataframe
selection related
- split columns
- lookup value
Series related
- view the index, type, length and name of the series
- unique types found in a object series
a vs b
- shape vs size
Missing data
Tasks
read_csv()
- read all columns as strings
check if
- check if some columns exist in a dataframe
data transformation related
- matches to coplays
difference between
convert stuff
dummy
Missing values
- number of missing values in a dataframe
- fill missing values using a lookup table
Sample usage
- read_csv usage
- using astype
stackoverflow questions I came across
pivot related
pandas documentation links I came across
- dummy
- user guide
difference between
- What is the difference between read_table and read_csv?

Creating a dataframe

Create a dataframe from list of lists

>>> import pandas as pd
>>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]
>>> df = pd.DataFrame(data)
>>> df
   0  1  2  3
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

You can also specify a list of columns for the dataframe.

>>> columns = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame(data, columns=columns)
>>> df
   a  b  c  d
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

To specify the index

>>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY'])
>>> df
    0  1  2  3   4    5
CT  1  2  3  4  15  6.0
NY  1  2  3  4   F  NaN

To make each list into a column, use zip

% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
# create lists
l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"]
l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"]
l3 =[23, 25, 22, 27, 29]
l4 =[69, 54, 73, 70, 74]
# create the dataframe
team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight'])
print(team)
     Name     Code  Age  Weight
0    Amar    Alpha   23      69
1  Barsha    Bravo   25      54
2  Carlos  Charlie   22      73
3  Tanmay    Tango   27      70
4  Misbah     Mike   29      74

Ref:-

https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.

tags | row by row

Create a data frame from two lists

On the interpreter

>>> import pandas as pd
>>> df = pd.DataFrame({'letters': ['a', 'b', 'c', 'd'], 'numbers': [1,2,3,4]})
>>> df
  letters  numbers
0       a        1
1       b        2
2       c        3
3       d        4

In the code

% cat create_df_from_lists.py
import pandas as pd
letters = ['a', 'b', 'c']
words = ['apple', 'ball', 'cat']
df = pd.DataFrame(
    {'letter': letters,
     'word': words})

print("letters:\n", letters, "\n")
print("words:\n", words, "\n")
print("df:\n", df, "\n")

Sample run:

 % python3 -u create_df_from_lists.py
letters:
 ['a', 'b', 'c'] 

words:
 ['apple', 'ball', 'cat'] 

df:
   letter   word
0      a  apple
1      b   ball
2      c    cat

Create a dataframe from a list of tuples

In [1]:
import pandas as pd
data = [(0, 1, 0, 1), (1, 0, 1, 1), (0, 1, 1, 1)]
df = pd.DataFrame(data)
df
Out[1]:
   0  1  2  3
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

You can also specify a list of columns for the dataframe.

In [2]:
columns=('a', 'b', 'c', 'd')
df = pd.DataFrame(data, columns=columns)
df
Out[2]:
   a  b  c  d
0  0  1  0  1
1  1  0  1  1
2  0  1  1  1

Create a dataframe using StringIO

In [1]: 
import pandas as pd
import io
data = """
Date,Open,High,Low,Close,Adj Close,Volume
2024-01-02,472.16,473.67,470.49,472.65,469.67,123623700
2024-01-03,470.43,471.19,468.17,468.79,465.84,103585900
2024-01-04,468.3,470.96,467.05,467.28,464.33,84232200
2024-01-05,467.49,470.44,466.43,467.92,464.97,86060800
2024-01-08,468.43,474.75,468.3,474.6,471.61,74879100
"""
df = pd.read_csv(io.StringIO(data))

In [2]: 
df
Out[2]: 
         Date    Open    High     Low   Close  Adj Close     Volume
0  2024-01-02  472.16  473.67  470.49  472.65     469.67  123623700
1  2024-01-03  470.43  471.19  468.17  468.79     465.84  103585900
2  2024-01-04  468.30  470.96  467.05  467.28     464.33   84232200
3  2024-01-05  467.49  470.44  466.43  467.92     464.97   86060800
4  2024-01-08  468.43  474.75  468.30  474.60     471.61   74879100

Tested with Python 3.12.3, IPython 8.25.0

Create a dataframe by splitting strings

Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.

Load the data into a pandas series

% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ]
s = pd.Series(strings)
print(s)
0    Netflix 100
1     Costco 200
2    Walmart 500
3     Costco 500
dtype: object

Create a dataframe by splitting the strings into multiple columns

In [2]: 
df = s.str.split(expand=True)
print(df)
         0    1
0  Netflix  100
1   Costco  200
2  Walmart  500
3   Costco  500

Name the columns.

str.split gives everything as strings. Change them to numbers as needed.

In [3]: 
df.columns = ['company', 'value']
df['value'] = df['value'].astype('float')
print(df)
   company  value
0  Netflix  100.0
1   Costco  200.0
2  Walmart  500.0
3   Costco  500.0

See also:

I used this trick to answer a question on stackoverflow - https://stackoverflow.com/a/75212176/6305733
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

also demonstrates | assign column names to a dataframe

Create a dataframe from a series of lists

tags | convert series with lists to dataframe

df = pd.DataFrame(s.to_list())

For example

In [1]:
import pandas as pd
s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
s
Out[1]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [2]:
df = pd.DataFrame(s.to_list())
df
Out[2]:
   0  1    2    3
0  1  2  3.0  NaN
1  4  5  6.0  7.0
2  8  9  NaN  NaN

If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example

In [5]:
s
Out[5]:
0       [1, 2, 3]
1    [4, 5, 6, 7]
2          [8, 9]
dtype: object

In [6]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 2
      1 import numpy as np
----> 2 df = pd.DataFrame(np.vstack(s))

File <__array_function__ internals>:200, in vstack(*args, **kwargs)

File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
    294 if not isinstance(arrs, list):
    295     arrs = [arrs]
--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)

File <__array_function__ internals>:200, in concatenate(*args, **kwargs)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4

But with

In [10]:
s = pd.Series([[1, 2, 3], [4, 5, 6]])
s
Out[10]:
0    [1, 2, 3]
1    [4, 5, 6]
dtype: object

In [11]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
df
Out[11]:
   0  1  2
0  1  2  3
1  4  5  6

In [12]:
df = pd.DataFrame(s.to_list())
df
Out[12]:
   0  1  2
0  1  2  3
1  4  5  6

See also:

https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe

Create a dataframe from a bunch of variables

import pandas as pd
df = pd.DataFrame({
  'key': ['var1', 'var2', 'var3'],
  'value': [var1, var2, var3]
})

For example

$ ipython

In [1]:
year = 2023; month = 6; date = 15

In [2]:
import pandas as pd
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [3]:
df
Out[3]:
     key  value
0   year   2023
1  month      6
2   date     15

In [4]:
df.dtypes
Out[4]:
key      object
value     int64
dtype: object

It works even if the variables are not of the same type.

In [5]:
year = 2023; month = 'June'; date = 15

In [6]:
df = pd.DataFrame({
  'key': ['year', 'month', 'date'],
  'value': [year, month, date]
})

In [7]:
df
Out[7]:
     key value
0   year  2023
1  month  June
2   date    15

In [8]:
df.dtypes
Out[8]:
key      object
value    object
dtype: object

Tested with Python 3.11.3, IPython 8.12.0

selection related

split columns

Separate positive and negative values (nbviewer.jupyter.org/github/KamarajuKusumanchi)

description | split a column into two columns based on whether the values are positive or negative, default value, longs and shorts

Split string column into multiple columns (nbviewer.jupyter.org/github/KamarajuKusumanchi)

tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]

lookup value

To pick the first value in column 'A' for rows where column B is FOO

df.loc[df['B'] == 'FOO', 'A'].iloc[0]

Example:

$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
print(df)
    A  B
0  p1  1
1  p2  3
2  p3  3
3  p4  2

In [2]:
df.loc[df['B'] == 3, 'A']
Out[2]:
1    p2
2    p3
Name: A, dtype: object

In [3]:
df.loc[df['B'] == 3, 'A'].iloc[0]
Out[3]:
'p2'

search tags | value of one column when another column equals something

Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas

Series related

view the index, type, length and name of the series

s.index
s.dtype
s.size
s.name

unique types found in a object series

First, use the .apply method with the type function to get back a Series that has the type of every member. Then, chain the .unique method onto the result.

s.apply(type).unique()

a vs b

shape vs size

Shape gives the number of elements in each dimension. Size gives the total number of elements.

In [1]:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
                   'col3': [5, 6]})

In [2]:
df
Out[2]:
   col1  col2  col3
0     1     3     5
1     2     4     6

In [3]:
df.shape
Out[3]:
(2, 3)

In [4]:
df.size
Out[4]:
6

using | pandas 1.1.2, python 3.8.18

Ref:-

See also:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html

Missing data

stringify nans

>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a
0    foo
1    NaN
2    bar
3    NaN
dtype: object
>>> a.fillna('baz')
0    foo
1    baz
2    bar
3    baz
dtype: object
>>> a
0    foo
1    NaN
2    bar
3    NaN
dtype: object

replace a string with NaN

% python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a.isnull()
0    False
1     True
2    False
3     True
dtype: bool
>>> b = a.fillna('baz')
>>> b
0    foo
1    baz
2    bar
3    baz
dtype: object
>>> c = b.replace('baz', np.nan, regex=True)
>>> c
0    foo
1    NaN
2    bar
3    NaN
dtype: object
>>> c.isnull()
0    False
1     True
2    False
3     True
dtype: bool

demonstrates | how to change a string to NaN

use values in another column as a fallback

$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'old' :['m', 'n', 'o', 'p'], 'new': ['a', None, 'b', None]})
df
Out[1]:
  old   new
0   m     a
1   n  None
2   o     b
3   p  None

In [2]:
df['new'] = df['new'].fillna(df['old'])
df
Out[2]:
  old new
0   m   a
1   n   n
2   o   b
3   p   p

using | Python 3.11.8, pandas 2.1.4

Tasks

extract last field from each row of a column
map one column to another
show all the rows and columns
- tags | max_rows, max_columns, set_option, all rows, all columns
print all characters in a cell
- tags | max_colwidth, set_option
set index from 0 to N
print dataframe without index
- tags | pretty print dataframe
Add dates
Add commas
Get first element of series if not empty
Order columns alphabetically
Get the first non null value in each column
Convert a dictionary of dataframes to a big dataframe
Convert string to date
print hundredths
swap columns

read_csv()

read all columns as strings

# read_csv preserves the white space. So strip it out.
df = pd.read_csv(file_name, dtype='str')\
    .rename(columns=lambda x: x.strip())\
    .apply(lambda x: x.str.strip())

[{demonstrates: strip spaces in the column names, strip all elements},
{tags: read_csv as all strings}]

check if

check if some columns exist in a dataframe

approach 1:

if set(['A', 'C']).issubset(df.columns):
   df['sum'] = df['A'] + df['C']

approach 2:

Instead of set([]), you can also use {} to create sets

if {'A', 'C'}.issubset(df.columns):
  ...

approach 3:

You can also do this using a generator comprehension

if all(item in df.columns for item in ['A', 'C']):
  ...

Ref:

https://stackoverflow.com/a/39371897 - where I found the answer
https://docs.python.org/3/library/stdtypes.html#frozenset.issubset
https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset
- talks about creating sets using {}
https://docs.python.org/3.13/reference/expressions.html#set-displays

data transformation related

matches to coplays

Consider the dataframe

   match_id player_id
0         0         a
1         0         b
2         0         c
3         1         a
4         1         b
5         2         c

which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2.

We want to get a coplay count

  player_id1 player_id2  size
0          a          a     2
1          a          b     2
2          a          c     1
3          b          a     2
4          b          b     2
5          b          c     1
6          c          a     1
7          c          b     1
8          c          c     2

which shows

the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game)
the number of games a player played if player_id1 $ == $ player_id2

Solution

difference between

astype vs. to_numeric

convert stuff

convert column names to lower case and replace space with underscore

df.columns = df.columns.str.lower().str.replace(' ', '_')

Example:

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [3]:
df
Out[3]:
  announcement_date fiscal_quarter_end  estimated_eps  actual_eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

convert column names to lower case

Approach 1:

df.columns = df.columns.str.lower()

Approach 2:

df.rename(columns=lambda x: x.lower(), inplace=True)

Notes:

I prefer approach1 - simple syntax, easy to remember.

Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, …, suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.

Example (using approach 1):

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.columns = df.columns.str.lower()

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

Example (using approach 2):

$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(
    [['2023-03-02', '2023-02-28', 3.20, 3.30],
     ['2022-12-08', '2022-11-30', 3.14, 3.10]],
    columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
  Announcement Date Fiscal Quarter End  Estimated EPS  Actual EPS
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

In [2]:
df.rename(columns=lambda x: x.lower(), inplace=True)

In [3]:
df
Out[3]:
  announcement date fiscal quarter end  estimated eps  actual eps
0        2023-03-02         2023-02-28           3.20         3.3
1        2022-12-08         2022-11-30           3.14         3.1

convert all categorical columns to string columns

df.loc[:, df.dtypes == "category"] = df.select_dtypes(
    ["category"]
    ).apply(lambda x: x.astype("object"))

Ref:- https://stackoverflow.com/a/56944992

convert all columns in a dataframe to string

df.astype(str)

Example:

In [1]: import pandas as pd
   ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]})
   ...: print(df)
     a    b
0  648  175
1  435  389

In [2]: df.dtypes
Out[2]:
a    int64
b    int64
dtype: object

In [3]: df2 = df.astype(str)
   ...: df2.dtypes
Out[3]:
a    object
b    object
dtype: object

In [4]: print(df2)
     a    b
0  648  175
1  435  389

convert two columns of dataframe into a dictionary

df.set_index('id')['value'].to_dict()

Example

In [1]:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]})
df
Out[1]:
   id  value
0   0    1.7
1   1    5.2
2   2    4.4

In [2]:
df.set_index('id')['value'].to_dict()
Out[2]:
{0: 1.7, 1: 5.2, 2: 4.4}

Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary

convert % to numbers

tags | convert percent symbol in strings to numbers

In [2]:
import pandas as pd
df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']})
df
Out[2]:
     foo
0   1.8%
1  -2.5%
2   0.7%
3   3.2%

In [3]:
df['foo'] = df['foo'].str.rstrip('%').astype(float)/100

In [4]:
df
Out[4]:
     foo
0  0.018
1 -0.025
2  0.007
3  0.032

Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2

Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv

dummy

json to dataframe

From	to	Use
JSON	DataFrame	pandas.read_json()
DataFrame	JSON	df.to_json() or df.to_json(orient='records')
JSON	Python object	json.loads()
Python object	JSON	json.dumps()

Sample code nbviewer/github/KamarajuKusumanchi

Ref:-

sum if columns exist

tags | sum if column exists

$ ipython
In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan],
                   'ID2':[7.0, 7.7, np.nan, 11.3, np.nan],
                   'ID5':[10, 15, 3.1, 2.2, np.nan]})
df
Out[1]:
    ID1   ID2   ID5
0  15.3   7.0  10.0
1  12.1   7.7  15.0
2  13.2   NaN   3.1
3  10.0  11.3   2.2
4   NaN   NaN   NaN

In [2]:
List_ID = ['ID1','ID2','ID3']
cols = df.columns[df.columns.isin(List_ID)]
cols
Out[2]:
Index(['ID1', 'ID2'], dtype='object')

In [3]:
res = df[cols].sum(axis=1)
res
Out[3]:
0    22.3
1    19.8
2    13.2
3    21.3
4     0.0
dtype: float64

Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1

Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas

get unique values in a series as another series

Series.unique() returns a numpy.ndarray
Series.drop_duplicates() returns a Series
- Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.

In [1]:
import pandas as pd
df = pd.DataFrame({
    'a': [1,2,3,3],
    'b': ['foo', 'foo', 'bar', 'bar']})
df
Out[1]:
   a    b
0  1  foo
1  2  foo
2  3  bar
3  3  bar

In [2]:
df['a'].unique()
Out[2]:
array([1, 2, 3], dtype=int64)

In [3]:
type(df['a'].unique())
Out[3]:
numpy.ndarray

In [4]:
df['a'].drop_duplicates()
Out[4]:
0    1
1    2
2    3
Name: a, dtype: int64

In [5]:
type(df['a'].drop_duplicates())
Out[5]:
pandas.core.series.Series

In [6]:
df['a'].drop_duplicates(keep='last')
Out[6]:
0    1
1    2
3    3
Name: a, dtype: int64

In [7]:
type(df['a'].drop_duplicates(keep='last'))
Out[7]:
pandas.core.series.Series

Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427

clean up column names

# Strip out the white space from both ends of the column names
df.columns = df.columns.str.strip()

# Remove white space from both ends of the column names
# -> convert them to lower case
# -> replace space with an underscore
# -> remove open and close parenthesis
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

round a single column

How to round a single column in pandas without affecting other columns? For example, given

df:

  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

How to get

  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

Solution:

df['value1'] = df['value1'].round()

Details:

In [1]: 
import pandas as pd
df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'],
                   'value1': [1.12, 1.50, 0.10, 3.30, 4.80],
                   'value2': [1.3, 2.5, 0.0, -1.0, -1.0]})
df
Out[1]: 
  item  value1  value2
0    a    1.12     1.3
1    a    1.50     2.5
2    a    0.10     0.0
3    b    3.30    -1.0
4    b    4.80    -1.0

In [2]: 
df['value1'] = df['value1'].round()
df
Out[2]: 
  item  value1  value2
0    a     1.0     1.3
1    a     2.0     2.5
2    a     0.0     0.0
3    b     3.0    -1.0
4    b     5.0    -1.0

relative percentage difference

Code:

% cat rel_pct_diff.py
import pandas as pd
import numpy as np
print('Using pandas', pd.__version__, ', numpy', np.__version__)
df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]})
print('orignal df')
print(df)
df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100
df['rel_pct_diff2'] = ((df['new'].divide(
    df['old'].where(df['old'] != 0, np.nan))) - 1) * 100
print(df)

Run:

% python ./rel_pct_diff.py
Using pandas 1.1.3 , numpy 1.19.1
orignal df
   old  new
0    2  2.1
1    1  1.1
2    0  0.1
3    5  4.9
4    0  0.0
   old  new  rel_pct_diff1  rel_pct_diff2
0    2  2.1            5.0            5.0
1    1  1.1           10.0           10.0
2    0  0.1            inf            NaN
3    5  4.9           -2.0           -2.0
4    0  0.0            NaN            NaN

how to do XYZ

https://stackoverflow.com/questions/51705595/pandas-map-values-from-one-column-to-another-column - shows how to link the previous position of a vehicle with its current position. Solution uses groupby + shift + (bfill or fillna)

start column numbers from 1

By default, column numbers start from 0.

% ipython
Python 3.8.3 (default, Jul  2 2020, 16:21:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: import numpy as np
   ...: n = np.ones(10).reshape(2,5)
   ...: print(n)
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

In [2]: df = pd.DataFrame(n)
   ...: print(df)
     0    1    2    3    4
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

To change them to start from 1

In [3]: df.columns = range(1, df.shape[1]+1)
   ...: print(df)
     1    2    3    4    5
0  1.0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0  1.0

where df.shape[1] gives the number of columns in a dataframe.

In [4]: print(df.shape[1])
5

read everything as string

pd.read_csv(file_path, dtype=str)

Ref:- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

dollars two decimal places

Use float_format=“%.2f”

df.to_csv(file_path, float_format="%.2f")

number of days between two YYYYMMDD string columns

tags | number of days between two dates, convert YYYYMMDD to date

a_dt = pd.to_datetime(df['a'], format='%Y%m%d')
b_dt = pd.to_datetime(df['b'], format='%Y%m%d')
df['days'] = (a_dt - b_dt).dt.days

See also:

https://docs.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver15 - similar operation in sql. {tags | sql datediff dd}

Select columns from a dataframe only if they exist

df[df.columns.intersection(set(['list', 'of', 'cols']))]

Example:

$ ipython
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:
import pandas as pd
df = pd.DataFrame(columns=[1,2,3,4])
df
Out[1]:
Empty DataFrame
Columns: [1, 2, 3, 4]
Index: []

In [2]:
df[df.columns.intersection(set([1, 2, 2, 5]))]
Out[2]:
Empty DataFrame
Columns: [1, 2]
Index: []

In [3]:
pd.__version__
Out[3]:
'1.2.1'

migrate sql case when else statements to build a new column

Use np.select()

Ref:-

https://stackoverflow.com/questions/49228596/pandas-case-when-default-in-pandas - contains an example and other possible approaches
https://stackoverflow.com/a/57392776
https://numpy.org/doc/stable/reference/generated/numpy.select.html
https://www.w3schools.com/sql/sql_case.asp

initialize column with strings that range from 1 to N

To set it to '1', '2', …, 'N'

df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]

To set it to 'A_1', 'A_2', …, 'A_N'

df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]

excel shows alternate blank lines in csv files written using to_csv()

Call csv with lineterminator=“\n”. For example

df.to_csv(path, index=False, lineterminator="\n")

tags | to_csv files have an extra line ending character at the end, to_csv line endings

Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

openpyxl vs. xlrd

context | read excel files, pd.read_excel

openpyxl supports newer Excel file formats. xlrd supports old-style Excel files (.xls) ¹⁾.

xlrd used to support .xlsx files at some point. But it dropped the support starting from 2.0.0 (2020-12-11) ²⁾.

When calling pd.read_excel(), if engine=None then a waterfall logic is used to determine the engine. This logic is described in https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html .

¹⁾ https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html → engine

²⁾ https://xlrd.readthedocs.io/en/latest/changes.html

Ref:

https://www.python-excel.org/
- This site contains pointers to the best information available about working with Excel files in the Python programming language.
https://xlrd.readthedocs.io/en/latest/index.html

Missing values

number of missing values in a dataframe

df.isna().sum().sum() - total number of missing values in the entire dataframe.
df.isna().sum() - number of missing values per column
df.isna() - boolean dataframe with True if value is missing, false otherwise.

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]})
df
Out[1]:
     A    B
0  3.0  NaN
1  NaN  5.0
2  5.0  NaN
3  7.0  4.0

In [2]:
df.isna().sum().sum()
Out[2]:
3

In [3]:
df.isna().sum()
Out[3]:
A    1
B    2
dtype: int64

In [4]:
df.isna()
Out[4]:
       A      B
0  False   True
1   True  False
2  False   True
3  False  False

fill missing values using a lookup table

Q. Fill the NaNs in

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

using the lookup table

                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

so that we get

                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ans: Get the missing locations using isna() and then use map with set_index.

% ipython
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'title': ['Executive Secretary', 'Administrative Officer',
              'Recruiting Manager', 'Senior Editor', 'Media Manager I'],
    'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]})
df
Out[1]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor                 NaN
4         Media Manager I                 NaN

In [2]: 
lookup = pd.DataFrame({
    'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'],
    'industry': ['Retail', 'Manufacturing', 'Health']})
lookup
Out[2]: 
                 title       industry
0  Executive Secretary         Retail
1        Senior Editor  Manufacturing
2      Media Manager I         Health

In [3]: 
missing = df['industry'].isna()

In [4]: 
df.loc[missing, 'industry'] = df.loc[missing, 'title'].map(
    lookup.set_index('title')['industry'])
df
Out[4]: 
                    title            industry
0     Executive Secretary              Health
1  Administrative Officer  Financial Services
2      Recruiting Manager            Property
3           Senior Editor       Manufacturing
4         Media Manager I              Health

Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.

Sample usage

read_csv usage

Read a csv file that has no headers and interpret all columns as strings

pd.read_csv(input_file, dtype=str, sep='|', header=None)

drop duplicate lines

pd.read_csv(input_file).drop_duplicates()

using astype

df = df.astype({'col_a': np.int64,
                'col_b': np.int64})

df['col_a'] = df['col_a'].astype(np.int64)

stackoverflow questions I came across

normalize data within each group

https://stackoverflow.com/questions/46419180/pandas-normalize-within-the-group

find many to one mappings

https://stackoverflow.com/questions/24005064/finding-rows-in-a-pandas-dataframe-with-columns-that-violate-a-one-to-one-mappin

links

pivot related

pivot and sum

call pivot_table() with aggfunc=np.sum

Ref:-

convert pivot table output to dataframe

Sample code

table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'],
                                         columns=['class_name'], aggfunc=np.sum)
table.columns = table.columns.droplevel(0)    # removes the values from level 0
table.columns.name = None                     # removes the columns as the column name
table = table.reset_index()                   # changes index to columns

Ref:-

https://stackoverflow.com/questions/43756052/transform-pandas-pivot-table-to-regular-dataframe - where I found the answer
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

normalize data

How to convert

$ cat data.csv
node_id,sim_id,portfolio_risk_total
11,1,10
11,2,20
11,3,30
11,4,40
22,1,100
22,2,200
22,3,300
22,4,400
33,1,1000
33,2,2000
33,3,3000
33,4,4000

to

$ cat out.csv
sim_id,11,22,33
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000

Answer:

The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .

In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df
Out[1]:
        node_id  sim_id  portfolio_risk_total
0            11       1                    10
1            11       2                    20
2            11       3                    30
3            11       4                    40
4            22       1                   100
5            22       2                   200
6            22       3                   300
7            22       4                   400
8            33       1                  1000
9            33       2                  2000
10           33       3                  3000
11           33       4                  4000

In [2]:
table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total')
table
Out[2]:
node_id      11   22    33
sim_id
1            10  100  1000
2            20  200  2000
3            30  300  3000
4            40  400  4000

In [3]:
table.columns.name = None
table = table.reset_index()
table
Out[3]:
   sim_id  11   22    33
0       1  10  100  1000
1       2  20  200  2000
2       3  30  300  3000
3       4  40  400  4000

In [4]:
table.to_csv('out.csv', index=False)

pandas documentation links I came across

dummy

url	function	notes
pandas.Series.str.split	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
pandas.Series.str.join	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.join.html#pandas.Series.str.join

url	function	notes
pandas.DataFrame.fillna	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
pandas.DataFrame.replace	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html
pandas.DataFrame.rename	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
pandas.DataFrame.notna	https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html

user guide

url	title	notes
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html	Reshaping and pivot tables	tags: stacked or record format, denormalized to normalized data

difference between

What is the difference between read_table and read_csv?

	read_csv	read_table
default separator	“,”	“\t”

The code for read_csv() and read_table() is in https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py . The functions are all same except for the default separator.

read_csv

read_table

relevant code

    kwds_defaults = _refine_defaults_read(
        dialect,
        delimiter,
        engine,
        sep,
        on_bad_lines,
        names,
        defaults={"delimiter": ","},
        dtype_backend=dtype_backend,
    )
    kwds.update(kwds_defaults)

    kwds_defaults = _refine_defaults_read(
        dialect,
        delimiter,
        engine,
        sep,
        on_bad_lines,
        names,
        defaults={"delimiter": "\t"},
        dtype_backend=dtype_backend,
    )
    kwds.update(kwds_defaults)

last checked | 2024-07-20

Table of Contents

Creating a dataframe

Create a dataframe from list of lists

Create a data frame from two lists

Create a dataframe from a list of tuples

Create a dataframe using StringIO

Create a dataframe by splitting strings

Create a dataframe from a series of lists

Create a dataframe from a bunch of variables

selection related

split columns

lookup value

Series related

view the index, type, length and name of the series

unique types found in a object series

a vs b

shape vs size

Missing data

stringify nans

replace a string with NaN

use values in another column as a fallback

Tasks

read_csv()

read all columns as strings

check if

check if some columns exist in a dataframe

data transformation related

matches to coplays

difference between

convert stuff

convert column names to lower case and replace space with underscore

convert column names to lower case

convert all categorical columns to string columns

convert all columns in a dataframe to string

convert two columns of dataframe into a dictionary

convert % to numbers

dummy

json to dataframe

sum if columns exist

get unique values in a series as another series

clean up column names

round a single column

relative percentage difference

how to do XYZ

start column numbers from 1

read everything as string

dollars two decimal places

number of days between two YYYYMMDD string columns

Select columns from a dataframe only if they exist

migrate sql case when else statements to build a new column

initialize column with strings that range from 1 to N

excel shows alternate blank lines in csv files written using to_csv()

openpyxl vs. xlrd

Missing values

number of missing values in a dataframe

fill missing values using a lookup table

Sample usage

read_csv usage

using astype

stackoverflow questions I came across

normalize data within each group

find many to one mappings

links

pivot related

pivot and sum

convert pivot table output to dataframe

normalize data

pandas documentation links I came across

dummy

user guide

difference between

What is the difference between read_table and read_csv?