===== Creating a dataframe =====
==== Create a dataframe from list of lists ====
>>> import pandas as pd
>>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3
0 0 1 0 1
1 1 0 1 1
2 0 1 1 1
You can also specify a list of columns for the dataframe.
>>> columns = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame(data, columns=columns)
>>> df
a b c d
0 0 1 0 1
1 1 0 1 1
2 0 1 1 1
To specify the index
>>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY'])
>>> df
0 1 2 3 4 5
CT 1 2 3 4 15 6.0
NY 1 2 3 4 F NaN
To make each list into a column, use zip
% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
# create lists
l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"]
l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"]
l3 =[23, 25, 22, 27, 29]
l4 =[69, 54, 73, 70, 74]
# create the dataframe
team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight'])
print(team)
Name Code Age Weight
0 Amar Alpha 23 69
1 Barsha Bravo 25 54
2 Carlos Charlie 22 73
3 Tanmay Tango 27 70
4 Misbah Mike 29 74
Ref:-
* https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
tags | row by row
==== Create a data frame from two lists ====
On the interpreter
>>> import pandas as pd
>>> df = pd.DataFrame({'letters': ['a', 'b', 'c', 'd'], 'numbers': [1,2,3,4]})
>>> df
letters numbers
0 a 1
1 b 2
2 c 3
3 d 4
In the code
% cat create_df_from_lists.py
import pandas as pd
letters = ['a', 'b', 'c']
words = ['apple', 'ball', 'cat']
df = pd.DataFrame(
{'letter': letters,
'word': words})
print("letters:\n", letters, "\n")
print("words:\n", words, "\n")
print("df:\n", df, "\n")
Sample run:
% python3 -u create_df_from_lists.py
letters:
['a', 'b', 'c']
words:
['apple', 'ball', 'cat']
df:
letter word
0 a apple
1 b ball
2 c cat
==== Create a dataframe from a list of tuples ====
In [1]:
import pandas as pd
data = [(0, 1, 0, 1), (1, 0, 1, 1), (0, 1, 1, 1)]
df = pd.DataFrame(data)
df
Out[1]:
0 1 2 3
0 0 1 0 1
1 1 0 1 1
2 0 1 1 1
You can also specify a list of columns for the dataframe.
In [2]:
columns=('a', 'b', 'c', 'd')
df = pd.DataFrame(data, columns=columns)
df
Out[2]:
a b c d
0 0 1 0 1
1 1 0 1 1
2 0 1 1 1
==== Create a dataframe using StringIO ====
In [1]:
import pandas as pd
import io
data = """
Date,Open,High,Low,Close,Adj Close,Volume
2024-01-02,472.16,473.67,470.49,472.65,469.67,123623700
2024-01-03,470.43,471.19,468.17,468.79,465.84,103585900
2024-01-04,468.3,470.96,467.05,467.28,464.33,84232200
2024-01-05,467.49,470.44,466.43,467.92,464.97,86060800
2024-01-08,468.43,474.75,468.3,474.6,471.61,74879100
"""
df = pd.read_csv(io.StringIO(data))
In [2]:
df
Out[2]:
Date Open High Low Close Adj Close Volume
0 2024-01-02 472.16 473.67 470.49 472.65 469.67 123623700
1 2024-01-03 470.43 471.19 468.17 468.79 465.84 103585900
2 2024-01-04 468.30 470.96 467.05 467.28 464.33 84232200
3 2024-01-05 467.49 470.44 466.43 467.92 464.97 86060800
4 2024-01-08 468.43 474.75 468.30 474.60 471.61 74879100
Tested with Python 3.12.3, IPython 8.25.0
==== Create a dataframe by splitting strings ====
Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
Load the data into a pandas series
% ipython
Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0]
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ]
s = pd.Series(strings)
print(s)
0 Netflix 100
1 Costco 200
2 Walmart 500
3 Costco 500
dtype: object
Create a dataframe by splitting the strings into multiple columns
In [2]:
df = s.str.split(expand=True)
print(df)
0 1
0 Netflix 100
1 Costco 200
2 Walmart 500
3 Costco 500
Name the columns.
str.split gives everything as strings. Change them to numbers as needed.
In [3]:
df.columns = ['company', 'value']
df['value'] = df['value'].astype('float')
print(df)
company value
0 Netflix 100.0
1 Costco 200.0
2 Walmart 500.0
3 Costco 500.0
See also:
* I used this trick to answer a question on stackoverflow - https://stackoverflow.com/a/75212176/6305733
* https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
also demonstrates | assign column names to a dataframe
==== Create a dataframe from a series of lists ====
tags | convert series with lists to dataframe
df = pd.DataFrame(s.to_list())
For example
In [1]:
import pandas as pd
s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
s
Out[1]:
0 [1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9]
dtype: object
In [2]:
df = pd.DataFrame(s.to_list())
df
Out[2]:
0 1 2 3
0 1 2 3.0 NaN
1 4 5 6.0 7.0
2 8 9 NaN NaN
If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
In [5]:
s
Out[5]:
0 [1, 2, 3]
1 [4, 5, 6, 7]
2 [8, 9]
dtype: object
In [6]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[6], line 2
1 import numpy as np
----> 2 df = pd.DataFrame(np.vstack(s))
File <__array_function__ internals>:200, in vstack(*args, **kwargs)
File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting)
294 if not isinstance(arrs, list):
295 arrs = [arrs]
--> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting)
File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
But with
In [10]:
s = pd.Series([[1, 2, 3], [4, 5, 6]])
s
Out[10]:
0 [1, 2, 3]
1 [4, 5, 6]
dtype: object
In [11]:
import numpy as np
df = pd.DataFrame(np.vstack(s))
df
Out[11]:
0 1 2
0 1 2 3
1 4 5 6
In [12]:
df = pd.DataFrame(s.to_list())
df
Out[12]:
0 1 2
0 1 2 3
1 4 5 6
See also:
* https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
==== Create a dataframe from a bunch of variables ====
import pandas as pd
df = pd.DataFrame({
'key': ['var1', 'var2', 'var3'],
'value': [var1, var2, var3]
})
For example
$ ipython
In [1]:
year = 2023; month = 6; date = 15
In [2]:
import pandas as pd
df = pd.DataFrame({
'key': ['year', 'month', 'date'],
'value': [year, month, date]
})
In [3]:
df
Out[3]:
key value
0 year 2023
1 month 6
2 date 15
In [4]:
df.dtypes
Out[4]:
key object
value int64
dtype: object
It works even if the variables are not of the same type.
In [5]:
year = 2023; month = 'June'; date = 15
In [6]:
df = pd.DataFrame({
'key': ['year', 'month', 'date'],
'value': [year, month, date]
})
In [7]:
df
Out[7]:
key value
0 year 2023
1 month June
2 date 15
In [8]:
df.dtypes
Out[8]:
key object
value object
dtype: object
Tested with Python 3.11.3, IPython 8.12.0
===== selection related =====
==== split columns ====
* [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/Separate%20positive%20and%20negative%20values.ipynb | Separate positive and negative values (nbviewer.jupyter.org/github/KamarajuKusumanchi)]]
description | split a column into two columns based on whether the values are positive or negative, default value, longs and shorts
* [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/dataframe%20split%20column%20string%20into%20multiple%20columns.ipynb | Split string column into multiple columns (nbviewer.jupyter.org/github/KamarajuKusumanchi)]]
tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
==== lookup value ====
To pick the first value in column 'A' for rows where column B is FOO
df.loc[df['B'] == 'FOO', 'A'].iloc[0]
Example:
$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]})
print(df)
A B
0 p1 1
1 p2 3
2 p3 3
3 p4 2
In [2]:
df.loc[df['B'] == 3, 'A']
Out[2]:
1 p2
2 p3
Name: A, dtype: object
In [3]:
df.loc[df['B'] == 3, 'A'].iloc[0]
Out[3]:
'p2'
search tags | value of one column when another column equals something
Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas
===== Series related =====
==== view the index, type, length and name of the series ====
s.index
s.dtype
s.size
s.name
==== unique types found in a object series ====
First, use the ''.apply'' method with the ''type'' function to get back a Series that has the type of every member. Then, //chain// the ''.unique'' method onto the result.
s.apply(type).unique()
===== a vs b =====
==== shape vs size ====
Shape gives the number of elements in each dimension. Size gives the total number of elements.
In [1]:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
'col3': [5, 6]})
In [2]:
df
Out[2]:
col1 col2 col3
0 1 3 5
1 2 4 6
In [3]:
df.shape
Out[3]:
(2, 3)
In [4]:
df.size
Out[4]:
6
using | pandas 1.1.2, python 3.8.18
Ref:-
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html
See also:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html
===== Missing data =====
==== stringify nans ====
>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a
0 foo
1 NaN
2 bar
3 NaN
dtype: object
>>> a.fillna('baz')
0 foo
1 baz
2 bar
3 baz
dtype: object
>>> a
0 foo
1 NaN
2 bar
3 NaN
dtype: object
==== replace a string with NaN ====
% python3
Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> a = pd.Series(['foo', np.nan, 'bar', np.nan])
>>> a.isnull()
0 False
1 True
2 False
3 True
dtype: bool
>>> b = a.fillna('baz')
>>> b
0 foo
1 baz
2 bar
3 baz
dtype: object
>>> c = b.replace('baz', np.nan, regex=True)
>>> c
0 foo
1 NaN
2 bar
3 NaN
dtype: object
>>> c.isnull()
0 False
1 True
2 False
3 True
dtype: bool
demonstrates | how to change a string to NaN
See also:- https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas
==== use values in another column as a fallback ====
$ ipython
In [1]:
import pandas as pd
df = pd.DataFrame({'old' :['m', 'n', 'o', 'p'], 'new': ['a', None, 'b', None]})
df
Out[1]:
old new
0 m a
1 n None
2 o b
3 p None
In [2]:
df['new'] = df['new'].fillna(df['old'])
df
Out[2]:
old new
0 m a
1 n n
2 o b
3 p p
using | Python 3.11.8, pandas 2.1.4
===== Tasks =====
* [[extract last field from each row of a column]]
* [[map one column to another]]
* [[show all the rows and columns]]
* tags | max_rows, max_columns, set_option, all rows, all columns
* [[print all characters in a cell]]
* tags | max_colwidth, set_option
* [[set index from 0 to N]]
* [[print dataframe without index]]
* tags | pretty print dataframe
* [[Add dates]]
* [[Add commas]]
* [[Get first element of series if not empty]]
* [[Order columns alphabetically]]
* [[Get the first non null value in each column]]
* [[Convert a dictionary of dataframes to a big dataframe]]
* [[Convert string to date]]
* [[print hundredths]]
* [[swap columns]]
===== read_csv() =====
==== read all columns as strings ====
# read_csv preserves the white space. So strip it out.
df = pd.read_csv(file_name, dtype='str')\
.rename(columns=lambda x: x.strip())\
.apply(lambda x: x.str.strip())
[{demonstrates: strip spaces in the column names, strip all elements},\\
{tags: read_csv as all strings}]
===== check if =====
==== check if some columns exist in a dataframe ====
approach 1:
if set(['A', 'C']).issubset(df.columns):
df['sum'] = df['A'] + df['C']
approach 2:
Instead of ''set([])'', you can also use ''{}'' to create sets
if {'A', 'C'}.issubset(df.columns):
...
approach 3:
You can also do this using a generator comprehension
if all(item in df.columns for item in ['A', 'C']):
...
Ref:
* https://stackoverflow.com/a/39371897 - where I found the answer
* https://docs.python.org/3/library/stdtypes.html#frozenset.issubset
* https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset
* talks about creating sets using ''{}''
* https://docs.python.org/3.13/reference/expressions.html#set-displays
===== data transformation related =====
==== matches to coplays ====
Consider the dataframe
match_id player_id
0 0 a
1 0 b
2 0 c
3 1 a
4 1 b
5 2 c
which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2.
We want to get a coplay count
player_id1 player_id2 size
0 a a 2
1 a b 2
2 a c 1
3 b a 2
4 b b 2
5 b c 1
6 c a 1
7 c b 1
8 c c 2
which shows
* the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game)
* the number of games a player played if player_id1 $ == $ player_id2
[[ matches to coplays | Solution ]]
===== difference between =====
* [[astype vs. to_numeric]]
===== convert stuff =====
==== convert column names to lower case and replace space with underscore ====
df.columns = df.columns.str.lower().str.replace(' ', '_')
Example:
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
df = pd.DataFrame(
[['2023-03-02', '2023-02-28', 3.20, 3.30],
['2022-12-08', '2022-11-30', 3.14, 3.10]],
columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
Announcement Date Fiscal Quarter End Estimated EPS Actual EPS
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
In [2]:
df.columns = df.columns.str.lower().str.replace(' ', '_')
In [3]:
df
Out[3]:
announcement_date fiscal_quarter_end estimated_eps actual_eps
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
==== convert column names to lower case ====
Approach 1:
df.columns = df.columns.str.lower()
Approach 2:
df.rename(columns=lambda x: x.lower(), inplace=True)
Notes:
* I prefer approach1 - simple syntax, easy to remember.
Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, ..., suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.
Example (using approach 1):
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
df = pd.DataFrame(
[['2023-03-02', '2023-02-28', 3.20, 3.30],
['2022-12-08', '2022-11-30', 3.14, 3.10]],
columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
Announcement Date Fiscal Quarter End Estimated EPS Actual EPS
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
In [2]:
df.columns = df.columns.str.lower()
In [3]:
df
Out[3]:
announcement date fiscal quarter end estimated eps actual eps
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
Example (using approach 2):
$ ipython
Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
df = pd.DataFrame(
[['2023-03-02', '2023-02-28', 3.20, 3.30],
['2022-12-08', '2022-11-30', 3.14, 3.10]],
columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"])
df
Out[1]:
Announcement Date Fiscal Quarter End Estimated EPS Actual EPS
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
In [2]:
df.rename(columns=lambda x: x.lower(), inplace=True)
In [3]:
df
Out[3]:
announcement date fiscal quarter end estimated eps actual eps
0 2023-03-02 2023-02-28 3.20 3.3
1 2022-12-08 2022-11-30 3.14 3.1
==== convert all categorical columns to string columns ====
df.loc[:, df.dtypes == "category"] = df.select_dtypes(
["category"]
).apply(lambda x: x.astype("object"))
Ref:- https://stackoverflow.com/a/56944992
==== convert all columns in a dataframe to string ====
df.astype(str)
Example:
In [1]: import pandas as pd
...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]})
...: print(df)
a b
0 648 175
1 435 389
In [2]: df.dtypes
Out[2]:
a int64
b int64
dtype: object
In [3]: df2 = df.astype(str)
...: df2.dtypes
Out[3]:
a object
b object
dtype: object
In [4]: print(df2)
a b
0 648 175
1 435 389
==== convert two columns of dataframe into a dictionary ====
df.set_index('id')['value'].to_dict()
Example
In [1]:
import pandas as pd
df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]})
df
Out[1]:
id value
0 0 1.7
1 1 5.2
2 2 4.4
In [2]:
df.set_index('id')['value'].to_dict()
Out[2]:
{0: 1.7, 1: 5.2, 2: 4.4}
Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary
==== convert % to numbers ====
tags | convert percent symbol in strings to numbers
In [2]:
import pandas as pd
df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']})
df
Out[2]:
foo
0 1.8%
1 -2.5%
2 0.7%
3 3.2%
In [3]:
df['foo'] = df['foo'].str.rstrip('%').astype(float)/100
In [4]:
df
Out[4]:
foo
0 0.018
1 -0.025
2 0.007
3 0.032
Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2
Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv
===== dummy =====
==== json to dataframe ====
^ From ^ to ^ Use ^
| JSON | DataFrame | pandas.read_json() |
| DataFrame | JSON |
* df.to_json() or
* df.to_json(orient='records')
|
| JSON | Python object | json.loads() |
| Python object | JSON | json.dumps() |
Sample code [[https://nbviewer.jupyter.org/github/KamarajuKusumanchi/notebooks/blob/master/pandas/json%20to%20dataframe.ipynb | nbviewer/github/KamarajuKusumanchi]]
Ref:-
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html
==== sum if columns exist ====
tags | sum if column exists
$ ipython
In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan],
'ID2':[7.0, 7.7, np.nan, 11.3, np.nan],
'ID5':[10, 15, 3.1, 2.2, np.nan]})
df
Out[1]:
ID1 ID2 ID5
0 15.3 7.0 10.0
1 12.1 7.7 15.0
2 13.2 NaN 3.1
3 10.0 11.3 2.2
4 NaN NaN NaN
In [2]:
List_ID = ['ID1','ID2','ID3']
cols = df.columns[df.columns.isin(List_ID)]
cols
Out[2]:
Index(['ID1', 'ID2'], dtype='object')
In [3]:
res = df[cols].sum(axis=1)
res
Out[3]:
0 22.3
1 19.8
2 13.2
3 21.3
4 0.0
dtype: float64
Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1
Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas
==== get unique values in a series as another series ====
* Series.unique() returns a numpy.ndarray
* Series.drop_duplicates() returns a Series
* Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.
In [1]:
import pandas as pd
df = pd.DataFrame({
'a': [1,2,3,3],
'b': ['foo', 'foo', 'bar', 'bar']})
df
Out[1]:
a b
0 1 foo
1 2 foo
2 3 bar
3 3 bar
In [2]:
df['a'].unique()
Out[2]:
array([1, 2, 3], dtype=int64)
In [3]:
type(df['a'].unique())
Out[3]:
numpy.ndarray
In [4]:
df['a'].drop_duplicates()
Out[4]:
0 1
1 2
2 3
Name: a, dtype: int64
In [5]:
type(df['a'].drop_duplicates())
Out[5]:
pandas.core.series.Series
In [6]:
df['a'].drop_duplicates(keep='last')
Out[6]:
0 1
1 2
3 3
Name: a, dtype: int64
In [7]:
type(df['a'].drop_duplicates(keep='last'))
Out[7]:
pandas.core.series.Series
Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427
==== clean up column names ====
# Strip out the white space from both ends of the column names
df.columns = df.columns.str.strip()
# Remove white space from both ends of the column names
# -> convert them to lower case
# -> replace space with an underscore
# -> remove open and close parenthesis
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
==== round a single column ====
How to round a single column in pandas without affecting other columns? For example, given
df:
item value1 value2
0 a 1.12 1.3
1 a 1.50 2.5
2 a 0.10 0.0
3 b 3.30 -1.0
4 b 4.80 -1.0
How to get
item value1 value2
0 a 1.0 1.3
1 a 2.0 2.5
2 a 0.0 0.0
3 b 3.0 -1.0
4 b 5.0 -1.0
Solution:
df['value1'] = df['value1'].round()
Details:
In [1]:
import pandas as pd
df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'],
'value1': [1.12, 1.50, 0.10, 3.30, 4.80],
'value2': [1.3, 2.5, 0.0, -1.0, -1.0]})
df
Out[1]:
item value1 value2
0 a 1.12 1.3
1 a 1.50 2.5
2 a 0.10 0.0
3 b 3.30 -1.0
4 b 4.80 -1.0
In [2]:
df['value1'] = df['value1'].round()
df
Out[2]:
item value1 value2
0 a 1.0 1.3
1 a 2.0 2.5
2 a 0.0 0.0
3 b 3.0 -1.0
4 b 5.0 -1.0
==== relative percentage difference ====
Code:
% cat rel_pct_diff.py
import pandas as pd
import numpy as np
print('Using pandas', pd.__version__, ', numpy', np.__version__)
df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]})
print('orignal df')
print(df)
df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100
df['rel_pct_diff2'] = ((df['new'].divide(
df['old'].where(df['old'] != 0, np.nan))) - 1) * 100
print(df)
Run:
% python ./rel_pct_diff.py
Using pandas 1.1.3 , numpy 1.19.1
orignal df
old new
0 2 2.1
1 1 1.1
2 0 0.1
3 5 4.9
4 0 0.0
old new rel_pct_diff1 rel_pct_diff2
0 2 2.1 5.0 5.0
1 1 1.1 10.0 10.0
2 0 0.1 inf NaN
3 5 4.9 -2.0 -2.0
4 0 0.0 NaN NaN
==== how to do XYZ ====
* https://stackoverflow.com/questions/51705595/pandas-map-values-from-one-column-to-another-column - shows how to link the previous position of a vehicle with its current position. Solution uses groupby + shift + (bfill or fillna)
==== start column numbers from 1 ====
By default, column numbers start from 0.
% ipython
Python 3.8.3 (default, Jul 2 2020, 16:21:59)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
...: import numpy as np
...: n = np.ones(10).reshape(2,5)
...: print(n)
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
In [2]: df = pd.DataFrame(n)
...: print(df)
0 1 2 3 4
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
To change them to start from 1
In [3]: df.columns = range(1, df.shape[1]+1)
...: print(df)
1 2 3 4 5
0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0
where df.shape[1] gives the number of columns in a dataframe.
In [4]: print(df.shape[1])
5
==== read everything as string ====
pd.read_csv(file_path, dtype=str)
Ref:- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
==== dollars two decimal places ====
Use float_format="%.2f"
df.to_csv(file_path, float_format="%.2f")
==== number of days between two YYYYMMDD string columns ====
tags | number of days between two dates, convert YYYYMMDD to date
a_dt = pd.to_datetime(df['a'], format='%Y%m%d')
b_dt = pd.to_datetime(df['b'], format='%Y%m%d')
df['days'] = (a_dt - b_dt).dt.days
See also:
* https://docs.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver15 - similar operation in sql. {tags | sql datediff dd}
==== Select columns from a dataframe only if they exist ====
df[df.columns.intersection(set(['list', 'of', 'cols']))]
Example:
$ ipython
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
df = pd.DataFrame(columns=[1,2,3,4])
df
Out[1]:
Empty DataFrame
Columns: [1, 2, 3, 4]
Index: []
In [2]:
df[df.columns.intersection(set([1, 2, 2, 5]))]
Out[2]:
Empty DataFrame
Columns: [1, 2]
Index: []
In [3]:
pd.__version__
Out[3]:
'1.2.1'
See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist
==== migrate sql case when else statements to build a new column ====
Use np.select()
Ref:-
* https://stackoverflow.com/questions/49228596/pandas-case-when-default-in-pandas - contains an example and other possible approaches
* https://stackoverflow.com/a/57392776
* https://numpy.org/doc/stable/reference/generated/numpy.select.html
* https://www.w3schools.com/sql/sql_case.asp
==== initialize column with strings that range from 1 to N ====
To set it to '1', '2', ..., 'N'
df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]
To set it to 'A_1', 'A_2', ..., 'A_N'
df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]
==== excel shows alternate blank lines in csv files written using to_csv() ====
Call csv with lineterminator="\n". For example
df.to_csv(path, index=False, lineterminator="\n")
tags | to_csv files have an extra line ending character at the end, to_csv line endings
Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
==== openpyxl vs. xlrd ====
context | read excel files, pd.read_excel
openpyxl supports newer Excel file formats. xlrd supports old-style Excel files (.xls) [(pandas_read_excel_engine>https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html -> engine)].
xlrd used to support .xlsx files at some point. But it dropped the support starting from 2.0.0 (2020-12-11) [(https://xlrd.readthedocs.io/en/latest/changes.html)].
When calling pd.read_excel(), if engine=None then a waterfall logic is used to determine the engine. This logic is described in https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html .
~~REFNOTES~~
Ref:
* https://www.python-excel.org/
* This site contains pointers to the best information available about working with Excel files in the Python programming language.
* https://xlrd.readthedocs.io/en/latest/index.html
===== Missing values =====
==== number of missing values in a dataframe ====
* df.isna().sum().sum() - total number of missing values in the entire dataframe.
* df.isna().sum() - number of missing values per column
* df.isna() - boolean dataframe with True if value is missing, false otherwise.
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]})
df
Out[1]:
A B
0 3.0 NaN
1 NaN 5.0
2 5.0 NaN
3 7.0 4.0
In [2]:
df.isna().sum().sum()
Out[2]:
3
In [3]:
df.isna().sum()
Out[3]:
A 1
B 2
dtype: int64
In [4]:
df.isna()
Out[4]:
A B
0 False True
1 True False
2 False True
3 False False
==== fill missing values using a lookup table ====
Q. Fill the NaNs in
title industry
0 Executive Secretary Health
1 Administrative Officer Financial Services
2 Recruiting Manager Property
3 Senior Editor NaN
4 Media Manager I NaN
using the lookup table
title industry
0 Executive Secretary Retail
1 Senior Editor Manufacturing
2 Media Manager I Health
so that we get
title industry
0 Executive Secretary Health
1 Administrative Officer Financial Services
2 Recruiting Manager Property
3 Senior Editor Manufacturing
4 Media Manager I Health
Ans: Get the missing locations using isna() and then use map with set_index.
% ipython
Python 3.8.5 (default, Sep 4 2020, 07:30:14)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'title': ['Executive Secretary', 'Administrative Officer',
'Recruiting Manager', 'Senior Editor', 'Media Manager I'],
'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]})
df
Out[1]:
title industry
0 Executive Secretary Health
1 Administrative Officer Financial Services
2 Recruiting Manager Property
3 Senior Editor NaN
4 Media Manager I NaN
In [2]:
lookup = pd.DataFrame({
'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'],
'industry': ['Retail', 'Manufacturing', 'Health']})
lookup
Out[2]:
title industry
0 Executive Secretary Retail
1 Senior Editor Manufacturing
2 Media Manager I Health
In [3]:
missing = df['industry'].isna()
In [4]:
df.loc[missing, 'industry'] = df.loc[missing, 'title'].map(
lookup.set_index('title')['industry'])
df
Out[4]:
title industry
0 Executive Secretary Health
1 Administrative Officer Financial Services
2 Recruiting Manager Property
3 Senior Editor Manufacturing
4 Media Manager I Health
Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.
===== Sample usage =====
==== read_csv usage ====
* Read a csv file that has no headers and interpret all columns as strings
pd.read_csv(input_file, dtype=str, sep='|', header=None)
* drop duplicate lines
pd.read_csv(input_file).drop_duplicates()
==== using astype ====
df = df.astype({'col_a': np.int64,
'col_b': np.int64})
df['col_a'] = df['col_a'].astype(np.int64)
===== stackoverflow questions I came across ====
==== normalize data within each group ====
* https://stackoverflow.com/questions/46419180/pandas-normalize-within-the-group
==== find many to one mappings ====
* https://stackoverflow.com/questions/24005064/finding-rows-in-a-pandas-dataframe-with-columns-that-violate-a-one-to-one-mappin
==== links ====
===== pivot related =====
==== pivot and sum ====
call pivot_table() with aggfunc=np.sum
Ref:-
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html - contains examples
* https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
==== convert pivot table output to dataframe ====
Sample code
table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'],
columns=['class_name'], aggfunc=np.sum)
table.columns = table.columns.droplevel(0) # removes the values from level 0
table.columns.name = None # removes the columns as the column name
table = table.reset_index() # changes index to columns
Ref:-
* https://stackoverflow.com/questions/43756052/transform-pandas-pivot-table-to-regular-dataframe - where I found the answer
* https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
==== normalize data ====
How to convert
$ cat data.csv
node_id,sim_id,portfolio_risk_total
11,1,10
11,2,20
11,3,30
11,4,40
22,1,100
22,2,200
22,3,300
22,4,400
33,1,1000
33,2,2000
33,3,3000
33,4,4000
to
$ cat out.csv
sim_id,11,22,33
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000
Answer:
The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .
In [1]:
import pandas as pd
df = pd.read_csv('data.csv')
df
Out[1]:
node_id sim_id portfolio_risk_total
0 11 1 10
1 11 2 20
2 11 3 30
3 11 4 40
4 22 1 100
5 22 2 200
6 22 3 300
7 22 4 400
8 33 1 1000
9 33 2 2000
10 33 3 3000
11 33 4 4000
In [2]:
table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total')
table
Out[2]:
node_id 11 22 33
sim_id
1 10 100 1000
2 20 200 2000
3 30 300 3000
4 40 400 4000
In [3]:
table.columns.name = None
table = table.reset_index()
table
Out[3]:
sim_id 11 22 33
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 40 400 4000
In [4]:
table.to_csv('out.csv', index=False)
===== pandas documentation links I came across =====
==== dummy ====
^ url ^ function ^ notes ^
| pandas.Series.str.split | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html |
| pandas.Series.str.join | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.join.html#pandas.Series.str.join |
^ url ^ function ^ notes ^
| pandas.DataFrame.fillna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html |
| pandas.DataFrame.replace | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html |
| pandas.DataFrame.rename | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html |
| pandas.DataFrame.notna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html |
==== user guide ====
^ url ^ title ^ notes ^
| https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html | Reshaping and pivot tables | tags: stacked or record format, denormalized to normalized data |
===== difference between =====
==== What is the difference between read_table and read_csv? ====
^ ^ read_csv ^ read_table ^
| default separator | "," | "\t" |
The code for read_csv() and read_table() is in https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py . The functions are all same except for the default separator.
^ ^ read_csv ^ read_table ^
| relevant code |
kwds_defaults = _refine_defaults_read(
dialect,
delimiter,
engine,
sep,
on_bad_lines,
names,
defaults={"delimiter": ","},
dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
|
kwds_defaults = _refine_defaults_read(
dialect,
delimiter,
engine,
sep,
on_bad_lines,
names,
defaults={"delimiter": "\t"},
dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
|
last checked | 2024-07-20