Table of Contents
Creating a dataframe
Create a dataframe from list of lists
>>> import pandas as pd >>> data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]] >>> df = pd.DataFrame(data) >>> df 0 1 2 3 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1
You can also specify a list of columns for the dataframe.
>>> columns = ['a', 'b', 'c', 'd'] >>> df = pd.DataFrame(data, columns=columns) >>> df a b c d 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1
To specify the index
>>> df = pd.DataFrame([[1,2,3,4,15,6],['1','2','3','4','F']], index=['CT','NY']) >>> df 0 1 2 3 4 5 CT 1 2 3 4 15 6.0 NY 1 2 3 4 F NaN
To make each list into a column, use zip
% ipython Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd # create lists l1 =["Amar", "Barsha", "Carlos", "Tanmay", "Misbah"] l2 =["Alpha", "Bravo", "Charlie", "Tango", "Mike"] l3 =[23, 25, 22, 27, 29] l4 =[69, 54, 73, 70, 74] # create the dataframe team = pd.DataFrame(list(zip(l1, l2, l3, l4)), columns=['Name', 'Code', 'Age', 'Weight']) print(team) Name Code Age Weight 0 Amar Alpha 23 69 1 Barsha Bravo 25 54 2 Carlos Charlie 22 73 3 Tanmay Tango 27 70 4 Misbah Mike 29 74
Ref:-
- https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/ - got the idea on zip from here.
tags | row by row
Create a data frame from two lists
On the interpreter
>>> import pandas as pd >>> df = pd.DataFrame({'letters': ['a', 'b', 'c', 'd'], 'numbers': [1,2,3,4]}) >>> df letters numbers 0 a 1 1 b 2 2 c 3 3 d 4
In the code
% cat create_df_from_lists.py import pandas as pd letters = ['a', 'b', 'c'] words = ['apple', 'ball', 'cat'] df = pd.DataFrame( {'letter': letters, 'word': words}) print("letters:\n", letters, "\n") print("words:\n", words, "\n") print("df:\n", df, "\n")
Sample run:
% python3 -u create_df_from_lists.py letters: ['a', 'b', 'c'] words: ['apple', 'ball', 'cat'] df: letter word 0 a apple 1 b ball 2 c cat
Create a dataframe from a list of tuples
In [1]: import pandas as pd data = [(0, 1, 0, 1), (1, 0, 1, 1), (0, 1, 1, 1)] df = pd.DataFrame(data) df Out[1]: 0 1 2 3 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1
You can also specify a list of columns for the dataframe.
In [2]: columns=('a', 'b', 'c', 'd') df = pd.DataFrame(data, columns=columns) df Out[2]: a b c d 0 0 1 0 1 1 1 0 1 1 2 0 1 1 1
Create a dataframe using StringIO
In [1]: import pandas as pd import io data = """ Date,Open,High,Low,Close,Adj Close,Volume 2024-01-02,472.16,473.67,470.49,472.65,469.67,123623700 2024-01-03,470.43,471.19,468.17,468.79,465.84,103585900 2024-01-04,468.3,470.96,467.05,467.28,464.33,84232200 2024-01-05,467.49,470.44,466.43,467.92,464.97,86060800 2024-01-08,468.43,474.75,468.3,474.6,471.61,74879100 """ df = pd.read_csv(io.StringIO(data)) In [2]: df Out[2]: Date Open High Low Close Adj Close Volume 0 2024-01-02 472.16 473.67 470.49 472.65 469.67 123623700 1 2024-01-03 470.43 471.19 468.17 468.79 465.84 103585900 2 2024-01-04 468.30 470.96 467.05 467.28 464.33 84232200 3 2024-01-05 467.49 470.44 466.43 467.92 464.97 86060800 4 2024-01-08 468.43 474.75 468.30 474.60 471.61 74879100
Tested with Python 3.12.3, IPython 8.25.0
Create a dataframe by splitting strings
Given a list of strings, the idea here is to create a data frame by splitting them into multiple columns.
Load the data into a pandas series
% ipython Python 3.10.6 (main, Oct 24 2022, 16:07:47) [GCC 11.2.0] IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd strings = ['Netflix 100' , 'Costco 200' , 'Walmart 500', 'Costco 500' ] s = pd.Series(strings) print(s) 0 Netflix 100 1 Costco 200 2 Walmart 500 3 Costco 500 dtype: object
Create a dataframe by splitting the strings into multiple columns
In [2]: df = s.str.split(expand=True) print(df) 0 1 0 Netflix 100 1 Costco 200 2 Walmart 500 3 Costco 500
Name the columns.
str.split gives everything as strings. Change them to numbers as needed.
In [3]: df.columns = ['company', 'value'] df['value'] = df['value'].astype('float') print(df) company value 0 Netflix 100.0 1 Costco 200.0 2 Walmart 500.0 3 Costco 500.0
See also:
- I used this trick to answer a question on stackoverflow - https://stackoverflow.com/a/75212176/6305733
also demonstrates | assign column names to a dataframe
Create a dataframe from a series of lists
tags | convert series with lists to dataframe
df = pd.DataFrame(s.to_list())
For example
In [1]: import pandas as pd s = pd.Series([[1, 2, 3], [4, 5, 6, 7], [8, 9]]) s Out[1]: 0 [1, 2, 3] 1 [4, 5, 6, 7] 2 [8, 9] dtype: object In [2]: df = pd.DataFrame(s.to_list()) df Out[2]: 0 1 2 3 0 1 2 3.0 NaN 1 4 5 6.0 7.0 2 8 9 NaN NaN
If the number of elements in each list is same, np.vstack() can be used but otherwise it will not work. For example
In [5]: s Out[5]: 0 [1, 2, 3] 1 [4, 5, 6, 7] 2 [8, 9] dtype: object In [6]: import numpy as np df = pd.DataFrame(np.vstack(s)) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[6], line 2 1 import numpy as np ----> 2 df = pd.DataFrame(np.vstack(s)) File <__array_function__ internals>:200, in vstack(*args, **kwargs) File ~\AppData\Local\conda\conda\envs\py311\Lib\site-packages\numpy\core\shape_base.py:296, in vstack(tup, dtype, casting) 294 if not isinstance(arrs, list): 295 arrs = [arrs] --> 296 return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting) File <__array_function__ internals>:200, in concatenate(*args, **kwargs) ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 3 and the array at index 1 has size 4
But with
In [10]: s = pd.Series([[1, 2, 3], [4, 5, 6]]) s Out[10]: 0 [1, 2, 3] 1 [4, 5, 6] dtype: object In [11]: import numpy as np df = pd.DataFrame(np.vstack(s)) df Out[11]: 0 1 2 0 1 2 3 1 4 5 6 In [12]: df = pd.DataFrame(s.to_list()) df Out[12]: 0 1 2 0 1 2 3 1 4 5 6
See also:
Create a dataframe from a bunch of variables
import pandas as pd df = pd.DataFrame({ 'key': ['var1', 'var2', 'var3'], 'value': [var1, var2, var3] })
For example
$ ipython In [1]: year = 2023; month = 6; date = 15 In [2]: import pandas as pd df = pd.DataFrame({ 'key': ['year', 'month', 'date'], 'value': [year, month, date] }) In [3]: df Out[3]: key value 0 year 2023 1 month 6 2 date 15 In [4]: df.dtypes Out[4]: key object value int64 dtype: object
It works even if the variables are not of the same type.
In [5]: year = 2023; month = 'June'; date = 15 In [6]: df = pd.DataFrame({ 'key': ['year', 'month', 'date'], 'value': [year, month, date] }) In [7]: df Out[7]: key value 0 year 2023 1 month June 2 date 15 In [8]: df.dtypes Out[8]: key object value object dtype: object
Tested with Python 3.11.3, IPython 8.12.0
selection related
split columns
description | split a column into two columns based on whether the values are positive or negative, default value, longs and shorts
tags | uses [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html pandas.Series.str.split]
lookup value
To pick the first value in column 'A' for rows where column B is FOO
df.loc[df['B'] == 'FOO', 'A'].iloc[0]
Example:
$ ipython In [1]: import pandas as pd df = pd.DataFrame({'A': ['p1', 'p2', 'p3', 'p4'], 'B': [1, 3, 3, 2]}) print(df) A B 0 p1 1 1 p2 3 2 p3 3 3 p4 2 In [2]: df.loc[df['B'] == 3, 'A'] Out[2]: 1 p2 2 p3 Name: A, dtype: object In [3]: df.loc[df['B'] == 3, 'A'].iloc[0] Out[3]: 'p2'
search tags | value of one column when another column equals something
Ref:- https://stackoverflow.com/questions/36684013/extract-column-value-based-on-another-column-in-pandas
Series related
view the index, type, length and name of the series
s.index s.dtype s.size s.name
unique types found in a object series
First, use the .apply
method with the type
function to get back a Series that has the type of every member. Then, chain the .unique
method onto the result.
s.apply(type).unique()
a vs b
shape vs size
Shape gives the number of elements in each dimension. Size gives the total number of elements.
In [1]: import pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}) In [2]: df Out[2]: col1 col2 col3 0 1 3 5 1 2 4 6 In [3]: df.shape Out[3]: (2, 3) In [4]: df.size Out[4]: 6
using | pandas 1.1.2, python 3.8.18
Ref:-
See also:
Missing data
stringify nans
>>> import pandas as pd >>> import numpy as np >>> a = pd.Series(['foo', np.nan, 'bar', np.nan]) >>> a 0 foo 1 NaN 2 bar 3 NaN dtype: object >>> a.fillna('baz') 0 foo 1 baz 2 bar 3 baz dtype: object >>> a 0 foo 1 NaN 2 bar 3 NaN dtype: object
replace a string with NaN
% python3 Python 3.5.3 (default, Jan 19 2017, 14:11:04) [GCC 6.3.0 20170118] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> import numpy as np >>> a = pd.Series(['foo', np.nan, 'bar', np.nan]) >>> a.isnull() 0 False 1 True 2 False 3 True dtype: bool >>> b = a.fillna('baz') >>> b 0 foo 1 baz 2 bar 3 baz dtype: object >>> c = b.replace('baz', np.nan, regex=True) >>> c 0 foo 1 NaN 2 bar 3 NaN dtype: object >>> c.isnull() 0 False 1 True 2 False 3 True dtype: bool
demonstrates | how to change a string to NaN
See also:- https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas
use values in another column as a fallback
$ ipython In [1]: import pandas as pd df = pd.DataFrame({'old' :['m', 'n', 'o', 'p'], 'new': ['a', None, 'b', None]}) df Out[1]: old new 0 m a 1 n None 2 o b 3 p None In [2]: df['new'] = df['new'].fillna(df['old']) df Out[2]: old new 0 m a 1 n n 2 o b 3 p p
using | Python 3.11.8, pandas 2.1.4
Tasks
- tags | max_rows, max_columns, set_option, all rows, all columns
- tags | max_colwidth, set_option
- tags | pretty print dataframe
read_csv()
read all columns as strings
# read_csv preserves the white space. So strip it out. df = pd.read_csv(file_name, dtype='str')\ .rename(columns=lambda x: x.strip())\ .apply(lambda x: x.str.strip())
[{demonstrates: strip spaces in the column names, strip all elements},
{tags: read_csv as all strings}]
check if
check if some columns exist in a dataframe
approach 1:
if set(['A', 'C']).issubset(df.columns): df['sum'] = df['A'] + df['C']
approach 2:
Instead of set([])
, you can also use {}
to create sets
if {'A', 'C'}.issubset(df.columns): ...
approach 3:
You can also do this using a generator comprehension
if all(item in df.columns for item in ['A', 'C']): ...
Ref:
- https://stackoverflow.com/a/39371897 - where I found the answer
- talks about creating sets using
{}
data transformation related
matches to coplays
Consider the dataframe
match_id player_id 0 0 a 1 0 b 2 0 c 3 1 a 4 1 b 5 2 c
which shows the players played in a given match. For example, it shows that [a, b, c] played in match 0, [a, b] played in match 1, and c alone played in match 2.
We want to get a coplay count
player_id1 player_id2 size 0 a a 2 1 a b 2 2 a c 1 3 b a 2 4 b b 2 5 b c 1 6 c a 1 7 c b 1 8 c c 2
which shows
- the number of games each player played with another player when player_id1 $ \neq $ player_id2 (ex:- a and b played in 2 games, a and c played in 1 game)
- the number of games a player played if player_id1 $ == $ player_id2
difference between
convert stuff
convert column names to lower case and replace space with underscore
df.columns = df.columns.str.lower().str.replace(' ', '_')
Example:
$ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.columns = df.columns.str.lower().str.replace(' ', '_') In [3]: df Out[3]: announcement_date fiscal_quarter_end estimated_eps actual_eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1
convert column names to lower case
Approach 1:
df.columns = df.columns.str.lower()
Approach 2:
df.rename(columns=lambda x: x.lower(), inplace=True)
Notes:
- I prefer approach1 - simple syntax, easy to remember.
Use case: While merging data from two data frames using DataFrame.merge(), I ended up with two columns with same name but differing in case (ex: foo from df1, FOO from df2). This caused problems when I tried to upload data into a hadoop cluster since hive is not case sensitive. As a work around, I converted the column names in df2 to lower case and then merged using pd.merge(df1, df2, …, suffixes = ('_df1', '_df2')). The resulting data frame will now have foo_df1, foo_df2 columns.
Example (using approach 1):
$ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.columns = df.columns.str.lower() In [3]: df Out[3]: announcement date fiscal quarter end estimated eps actual eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1
Example (using approach 2):
$ ipython Python 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)] IPython 8.8.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame( [['2023-03-02', '2023-02-28', 3.20, 3.30], ['2022-12-08', '2022-11-30', 3.14, 3.10]], columns = ["Announcement Date", "Fiscal Quarter End", "Estimated EPS", "Actual EPS"]) df Out[1]: Announcement Date Fiscal Quarter End Estimated EPS Actual EPS 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1 In [2]: df.rename(columns=lambda x: x.lower(), inplace=True) In [3]: df Out[3]: announcement date fiscal quarter end estimated eps actual eps 0 2023-03-02 2023-02-28 3.20 3.3 1 2022-12-08 2022-11-30 3.14 3.1
convert all categorical columns to string columns
df.loc[:, df.dtypes == "category"] = df.select_dtypes( ["category"] ).apply(lambda x: x.astype("object"))
convert all columns in a dataframe to string
df.astype(str)
Example:
In [1]: import pandas as pd ...: df = pd.DataFrame({'a': [648, 435], 'b': [175, 389]}) ...: print(df) a b 0 648 175 1 435 389 In [2]: df.dtypes Out[2]: a int64 b int64 dtype: object In [3]: df2 = df.astype(str) ...: df2.dtypes Out[3]: a object b object dtype: object In [4]: print(df2) a b 0 648 175 1 435 389
convert two columns of dataframe into a dictionary
df.set_index('id')['value'].to_dict()
Example
In [1]: import pandas as pd df = pd.DataFrame({'id': [0, 1, 2], 'value': [1.7, 5.2, 4.4]}) df Out[1]: id value 0 0 1.7 1 1 5.2 2 2 4.4 In [2]: df.set_index('id')['value'].to_dict() Out[2]: {0: 1.7, 1: 5.2, 2: 4.4}
Ref:- https://stackoverflow.com/questions/18695605/python-pandas-dataframe-to-dictionary
convert % to numbers
tags | convert percent symbol in strings to numbers
In [2]: import pandas as pd df = pd.DataFrame({'foo': ['1.8%', '-2.5%', '0.7%', '3.2%']}) df Out[2]: foo 0 1.8% 1 -2.5% 2 0.7% 3 3.2% In [3]: df['foo'] = df['foo'].str.rstrip('%').astype(float)/100 In [4]: df Out[4]: foo 0 0.018 1 -0.025 2 0.007 3 0.032
Tested using python 3.10.9, ipython 8.8.0, and pandas 1.5.2
Ref:- https://stackoverflow.com/questions/25669588/convert-percent-string-to-float-in-pandas-read-csv
dummy
json to dataframe
From | to | Use |
---|---|---|
JSON | DataFrame | pandas.read_json() |
DataFrame | JSON |
|
JSON | Python object | json.loads() |
Python object | JSON | json.dumps() |
Sample code nbviewer/github/KamarajuKusumanchi
Ref:-
sum if columns exist
tags | sum if column exists
$ ipython In [1]: import numpy as np import pandas as pd df = pd.DataFrame({'ID1':[15.3, 12.1, 13.2, 10.0, np.nan], 'ID2':[7.0, 7.7, np.nan, 11.3, np.nan], 'ID5':[10, 15, 3.1, 2.2, np.nan]}) df Out[1]: ID1 ID2 ID5 0 15.3 7.0 10.0 1 12.1 7.7 15.0 2 13.2 NaN 3.1 3 10.0 11.3 2.2 4 NaN NaN NaN In [2]: List_ID = ['ID1','ID2','ID3'] cols = df.columns[df.columns.isin(List_ID)] cols Out[2]: Index(['ID1', 'ID2'], dtype='object') In [3]: res = df[cols].sum(axis=1) res Out[3]: 0 22.3 1 19.8 2 13.2 3 21.3 4 0.0 dtype: float64
Using Python 3.9.4, IPython 7.22.0, pandas 1.2.4, numpy 1.20.1
Ref:- https://stackoverflow.com/questions/38700848/adding-columns-if-they-exist-in-the-dataframe-pandas
get unique values in a series as another series
- Series.unique() returns a numpy.ndarray
- Series.drop_duplicates() returns a Series
- Series.drop_duplicates(keep='first') retains the first occurrence of any duplicates, keep='last' retains the last occurrence, and keep=False retains NONE of the duplicates. keep='first' is the default.
In [1]: import pandas as pd df = pd.DataFrame({ 'a': [1,2,3,3], 'b': ['foo', 'foo', 'bar', 'bar']}) df Out[1]: a b 0 1 foo 1 2 foo 2 3 bar 3 3 bar In [2]: df['a'].unique() Out[2]: array([1, 2, 3], dtype=int64) In [3]: type(df['a'].unique()) Out[3]: numpy.ndarray In [4]: df['a'].drop_duplicates() Out[4]: 0 1 1 2 2 3 Name: a, dtype: int64 In [5]: type(df['a'].drop_duplicates()) Out[5]: pandas.core.series.Series In [6]: df['a'].drop_duplicates(keep='last') Out[6]: 0 1 1 2 3 3 Name: a, dtype: int64 In [7]: type(df['a'].drop_duplicates(keep='last')) Out[7]: pandas.core.series.Series
Ref:- https://github.com/pandas-dev/pandas/issues/1923#issuecomment-398217427
clean up column names
# Strip out the white space from both ends of the column names df.columns = df.columns.str.strip()
# Remove white space from both ends of the column names # -> convert them to lower case # -> replace space with an underscore # -> remove open and close parenthesis df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
round a single column
How to round a single column in pandas without affecting other columns? For example, given
df: item value1 value2 0 a 1.12 1.3 1 a 1.50 2.5 2 a 0.10 0.0 3 b 3.30 -1.0 4 b 4.80 -1.0
How to get
item value1 value2 0 a 1.0 1.3 1 a 2.0 2.5 2 a 0.0 0.0 3 b 3.0 -1.0 4 b 5.0 -1.0
Solution:
df['value1'] = df['value1'].round()
Details:
In [1]: import pandas as pd df = pd.DataFrame({'item': ['a', 'a', 'a', 'b', 'b'], 'value1': [1.12, 1.50, 0.10, 3.30, 4.80], 'value2': [1.3, 2.5, 0.0, -1.0, -1.0]}) df Out[1]: item value1 value2 0 a 1.12 1.3 1 a 1.50 2.5 2 a 0.10 0.0 3 b 3.30 -1.0 4 b 4.80 -1.0 In [2]: df['value1'] = df['value1'].round() df Out[2]: item value1 value2 0 a 1.0 1.3 1 a 2.0 2.5 2 a 0.0 0.0 3 b 3.0 -1.0 4 b 5.0 -1.0
relative percentage difference
Code:
% cat rel_pct_diff.py import pandas as pd import numpy as np print('Using pandas', pd.__version__, ', numpy', np.__version__) df = pd.DataFrame({'old': [2, 1, 0, 5, 0], 'new': [2.1, 1.1, 0.1, 4.9, 0]}) print('orignal df') print(df) df['rel_pct_diff1'] = ((df['new'] / df['old']) - 1) * 100 df['rel_pct_diff2'] = ((df['new'].divide( df['old'].where(df['old'] != 0, np.nan))) - 1) * 100 print(df)
Run:
% python ./rel_pct_diff.py Using pandas 1.1.3 , numpy 1.19.1 orignal df old new 0 2 2.1 1 1 1.1 2 0 0.1 3 5 4.9 4 0 0.0 old new rel_pct_diff1 rel_pct_diff2 0 2 2.1 5.0 5.0 1 1 1.1 10.0 10.0 2 0 0.1 inf NaN 3 5 4.9 -2.0 -2.0 4 0 0.0 NaN NaN
how to do XYZ
- https://stackoverflow.com/questions/51705595/pandas-map-values-from-one-column-to-another-column - shows how to link the previous position of a vehicle with its current position. Solution uses groupby + shift + (bfill or fillna)
start column numbers from 1
By default, column numbers start from 0.
% ipython Python 3.8.3 (default, Jul 2 2020, 16:21:59) Type 'copyright', 'credits' or 'license' for more information IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd ...: import numpy as np ...: n = np.ones(10).reshape(2,5) ...: print(n) [[1. 1. 1. 1. 1.] [1. 1. 1. 1. 1.]] In [2]: df = pd.DataFrame(n) ...: print(df) 0 1 2 3 4 0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 1.0
To change them to start from 1
In [3]: df.columns = range(1, df.shape[1]+1) ...: print(df) 1 2 3 4 5 0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0 1.0 1.0 1.0
where df.shape[1] gives the number of columns in a dataframe.
In [4]: print(df.shape[1]) 5
read everything as string
pd.read_csv(file_path, dtype=str)
Ref:- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
dollars two decimal places
Use float_format=“%.2f”
df.to_csv(file_path, float_format="%.2f")
number of days between two YYYYMMDD string columns
tags | number of days between two dates, convert YYYYMMDD to date
a_dt = pd.to_datetime(df['a'], format='%Y%m%d') b_dt = pd.to_datetime(df['b'], format='%Y%m%d') df['days'] = (a_dt - b_dt).dt.days
See also:
- https://docs.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-ver15 - similar operation in sql. {tags | sql datediff dd}
Select columns from a dataframe only if they exist
df[df.columns.intersection(set(['list', 'of', 'cols']))]
Example:
$ ipython Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd df = pd.DataFrame(columns=[1,2,3,4]) df Out[1]: Empty DataFrame Columns: [1, 2, 3, 4] Index: [] In [2]: df[df.columns.intersection(set([1, 2, 2, 5]))] Out[2]: Empty DataFrame Columns: [1, 2] Index: [] In [3]: pd.__version__ Out[3]: '1.2.1'
See also:- https://stackoverflow.com/questions/43537166/select-columns-from-dataframe-on-condition-they-exist
migrate sql case when else statements to build a new column
Use np.select()
Ref:-
- https://stackoverflow.com/questions/49228596/pandas-case-when-default-in-pandas - contains an example and other possible approaches
initialize column with strings that range from 1 to N
To set it to '1', '2', …, 'N'
df['foo'] = [str(x) for x in range(1, 1 + df.shape[0])]
To set it to 'A_1', 'A_2', …, 'A_N'
df['foo'] = ['A_' + str(x) for x in range(1, 1 + df.shape[0])]
excel shows alternate blank lines in csv files written using to_csv()
Call csv with lineterminator=“\n”. For example
df.to_csv(path, index=False, lineterminator="\n")
tags | to_csv files have an extra line ending character at the end, to_csv line endings
Ref:- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
openpyxl vs. xlrd
context | read excel files, pd.read_excel
openpyxl supports newer Excel file formats. xlrd supports old-style Excel files (.xls) 1).
xlrd used to support .xlsx files at some point. But it dropped the support starting from 2.0.0 (2020-12-11) 2).
When calling pd.read_excel(), if engine=None then a waterfall logic is used to determine the engine. This logic is described in https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html .
Ref:
- This site contains pointers to the best information available about working with Excel files in the Python programming language.
Missing values
number of missing values in a dataframe
- df.isna().sum().sum() - total number of missing values in the entire dataframe.
- df.isna().sum() - number of missing values per column
- df.isna() - boolean dataframe with True if value is missing, false otherwise.
In [1]: import pandas as pd import numpy as np df = pd.DataFrame({'A': [3, None, 5, 7], 'B': [np.nan, 5, None, 4]}) df Out[1]: A B 0 3.0 NaN 1 NaN 5.0 2 5.0 NaN 3 7.0 4.0 In [2]: df.isna().sum().sum() Out[2]: 3 In [3]: df.isna().sum() Out[3]: A 1 B 2 dtype: int64 In [4]: df.isna() Out[4]: A B 0 False True 1 True False 2 False True 3 False False
fill missing values using a lookup table
Q. Fill the NaNs in
title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor NaN 4 Media Manager I NaN
using the lookup table
title industry 0 Executive Secretary Retail 1 Senior Editor Manufacturing 2 Media Manager I Health
so that we get
title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor Manufacturing 4 Media Manager I Health
Ans: Get the missing locations using isna() and then use map with set_index.
% ipython Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: import pandas as pd import numpy as np df = pd.DataFrame({ 'title': ['Executive Secretary', 'Administrative Officer', 'Recruiting Manager', 'Senior Editor', 'Media Manager I'], 'industry': ['Health', 'Financial Services', 'Property', np.nan, np.nan]}) df Out[1]: title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor NaN 4 Media Manager I NaN In [2]: lookup = pd.DataFrame({ 'title': ['Executive Secretary', 'Senior Editor', 'Media Manager I'], 'industry': ['Retail', 'Manufacturing', 'Health']}) lookup Out[2]: title industry 0 Executive Secretary Retail 1 Senior Editor Manufacturing 2 Media Manager I Health In [3]: missing = df['industry'].isna() In [4]: df.loc[missing, 'industry'] = df.loc[missing, 'title'].map( lookup.set_index('title')['industry']) df Out[4]: title industry 0 Executive Secretary Health 1 Administrative Officer Financial Services 2 Recruiting Manager Property 3 Senior Editor Manufacturing 4 Media Manager I Health
Ref:- https://stackoverflow.com/questions/64438066/how-can-i-fillna-based-on-the-columns-from-another-dataframe/ - original question. Here, I changed the column names for brevity.
Sample usage
read_csv usage
- Read a csv file that has no headers and interpret all columns as strings
pd.read_csv(input_file, dtype=str, sep='|', header=None)
- drop duplicate lines
pd.read_csv(input_file).drop_duplicates()
using astype
df = df.astype({'col_a': np.int64, 'col_b': np.int64})
df['col_a'] = df['col_a'].astype(np.int64)
stackoverflow questions I came across
normalize data within each group
find many to one mappings
links
<place holder>
pivot related
pivot and sum
call pivot_table() with aggfunc=np.sum
Ref:-
convert pivot table output to dataframe
Sample code
table = df[cols_of_interest].pivot_table(values=['amount'], index=['foo', 'bar'], columns=['class_name'], aggfunc=np.sum) table.columns = table.columns.droplevel(0) # removes the values from level 0 table.columns.name = None # removes the columns as the column name table = table.reset_index() # changes index to columns
Ref:-
- https://stackoverflow.com/questions/43756052/transform-pandas-pivot-table-to-regular-dataframe - where I found the answer
normalize data
How to convert
$ cat data.csv node_id,sim_id,portfolio_risk_total 11,1,10 11,2,20 11,3,30 11,4,40 22,1,100 22,2,200 22,3,300 22,4,400 33,1,1000 33,2,2000 33,3,3000 33,4,4000
to
$ cat out.csv sim_id,11,22,33 1,10,100,1000 2,20,200,2000 3,30,300,3000 4,40,400,4000
Answer:
The technical term for this is ‘pivoting’. You are converting “stacked” or “record” format data into a normalized form. Below is how to do it. For more complex and related operations, see https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html .
In [1]: import pandas as pd df = pd.read_csv('data.csv') df Out[1]: node_id sim_id portfolio_risk_total 0 11 1 10 1 11 2 20 2 11 3 30 3 11 4 40 4 22 1 100 5 22 2 200 6 22 3 300 7 22 4 400 8 33 1 1000 9 33 2 2000 10 33 3 3000 11 33 4 4000 In [2]: table = df.pivot(index='sim_id', columns='node_id', values='portfolio_risk_total') table Out[2]: node_id 11 22 33 sim_id 1 10 100 1000 2 20 200 2000 3 30 300 3000 4 40 400 4000 In [3]: table.columns.name = None table = table.reset_index() table Out[3]: sim_id 11 22 33 0 1 10 100 1000 1 2 20 200 2000 2 3 30 300 3000 3 4 40 400 4000 In [4]: table.to_csv('out.csv', index=False)
pandas documentation links I came across
dummy
url | function | notes |
---|---|---|
pandas.Series.str.split | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html | |
pandas.Series.str.join | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.join.html#pandas.Series.str.join |
url | function | notes |
---|---|---|
pandas.DataFrame.fillna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html | |
pandas.DataFrame.replace | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html | |
pandas.DataFrame.rename | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html | |
pandas.DataFrame.notna | https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html |
user guide
url | title | notes |
---|---|---|
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html | Reshaping and pivot tables | tags: stacked or record format, denormalized to normalized data |
difference between
What is the difference between read_table and read_csv?
read_csv | read_table | |
---|---|---|
default separator | “,” | “\t” |
The code for read_csv() and read_table() is in https://github.com/pandas-dev/pandas/blob/main/pandas/io/parsers/readers.py . The functions are all same except for the default separator.
read_csv | read_table | |
---|---|---|
relevant code | kwds_defaults = _refine_defaults_read( dialect, delimiter, engine, sep, on_bad_lines, names, defaults={"delimiter": ","}, dtype_backend=dtype_backend, ) kwds.update(kwds_defaults) | kwds_defaults = _refine_defaults_read( dialect, delimiter, engine, sep, on_bad_lines, names, defaults={"delimiter": "\t"}, dtype_backend=dtype_backend, ) kwds.update(kwds_defaults) |
last checked | 2024-07-20