Skip to content

bobby_dreamer

Python tips

python, pandas1 min read

Last updated : 08/August/2020

Events are in reverse chronological order.



Drop duplicates, but ignore nulls
1# Creating a sample dataframe
2df = pd.DataFrame()
3names = ['Gandolf', 'Gandolf','Gimli','Frodo','Legolas','Bilbo', 'Aragorn', 'Arwen']
4types = ['Wizard', 'Wizard','Dwarf','Hobbit','Elf','Hobbit', np.nan, np.nan]
5magic = [0, 10, 1, 4, 6, 4, np.nan, np.nan]
6aggression = [0, 7, 10, 2, 5, 1, np.nan, np.nan]
7stealth = [0, 8, 2, 5, 10, 5, np.nan, np.nan]
8df['names'], df['type'], df['magic_power'] = names, types, magic
9df['aggression'], df['stealth'] = aggression, stealth
10
11df
12+----+---------+--------+---------------+--------------+-----------+
13| | names | type | magic_power | aggression | stealth |
14|----+---------+--------+---------------+--------------+-----------|
15| 0 | Gandolf | Wizard | 0 | 0 | 0 |
16| 1 | Gandolf | Wizard | 10 | 7 | 8 |
17| 2 | Gimli | Dwarf | 1 | 10 | 2 |
18| 3 | Frodo | Hobbit | 4 | 2 | 5 |
19| 4 | Legolas | Elf | 6 | 5 | 10 |
20| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
21| 6 | Aragorn | nan | nan | nan | nan |
22| 7 | Arwen | nan | nan | nan | nan |
23+----+---------+--------+---------------+--------------+-----------+

Here we have two sets of duplicates and as per the command we are keeping the last and eliminating the first.

1df1 = df[(~df['type'].duplicated(keep='last'))]
2+----+---------+--------+---------------+--------------+-----------+
3| | names | type | magic_power | aggression | stealth |
4|----+---------+--------+---------------+--------------+-----------|
5| 1 | Gandolf | Wizard | 10 | 7 | 8 |
6| 2 | Gimli | Dwarf | 1 | 10 | 2 |
7| 4 | Legolas | Elf | 6 | 5 | 10 |
8| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
9| 7 | Arwen | nan | nan | nan | nan |
10+----+---------+--------+---------------+--------------+-----------+

General idea is NaNs are suppose to be incomparable as it basically means 'no value'. So, comparing no values and keeping the last does not make sense. So, we do this and here we are maintaining the uniqueness and keeping nulls

1df1 = df[(~df['type'].duplicated(keep='last')) | df['type'].isna()]
2+----+---------+--------+---------------+--------------+-----------+
3| | names | type | magic_power | aggression | stealth |
4|----+---------+--------+---------------+--------------+-----------|
5| 1 | Gandolf | Wizard | 10 | 7 | 8 |
6| 2 | Gimli | Dwarf | 1 | 10 | 2 |
7| 4 | Legolas | Elf | 6 | 5 | 10 |
8| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
9| 6 | Aragorn | nan | nan | nan | nan |
10| 7 | Arwen | nan | nan | nan | nan |
11+----+---------+--------+---------------+--------------+-----------+
Convert two arrays to dictionary
1keys = ['a', 'b', 'c']
2values = [1, 2, 3]
3dictionary = dict(zip(keys, values))
4print(dictionary)
5
6# output
7{'b': 2, 'a': 1, 'c': 3}

View specific row by index
1df_joined.loc[[28]]
2df_joined.iloc[[27]]

Read all columns into pandas as object
1df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Common pandas options
1pd.set_option('display.max_columns', None)
2pd.set_option('display.max_colwidth', 200)
3
4# To display all rows
5pd.set_option('display.max_rows', None)
6
7# Setting limit to display max 100 rows
8pd.set_option('display.max_rows', 100)
9
10# When there are lots of columns (by default 100+) then df.info() wouldn't show all the columns,
11<class 'pandas.core.frame.DataFrame'>
12RangeIndex: 3984 entries, 0 to 3983
13Columns: 114 entries, name to process
14dtypes: datetime64[ns](10), float64(68), int64(1), object(35)
15memory usage: 2.9+ MB
16
17# When that happens, you can use,
18df.info(verbose=True, null_counts=True)
19
20# What option sets that default 100 is this
21pd.options.display.max_info_columns
22
23print('Max columns in display setting: ', pd.get_option('max_columns'))

Display all columns of specific datatype

If you have a column naming convention, like in column name itself we can identify datatype, you can use below method to get those columns. Here we are trying to get date columns,

Method 1

1[col for col in df_companies.columns if 'date' in col.lower()]

Method 2

1df_companies['lastraded'].dtypes
2# Output:
3dtype('<M8[ns]')
4
5df_companies.dtypes[df_companies.dtypes=='<M8[ns]']
6
7# Output:
8lastraded datetime64[ns]
9QresultDate datetime64[ns]
10PLResultDate datetime64[ns]
11BSResultDate datetime64[ns]
12RResultDate datetime64[ns]
13CFResultDate datetime64[ns]
14YRResultDate datetime64[ns]
15HiDate datetime64[ns]
16LoDate datetime64[ns]
17TVDate datetime64[ns]
18dtype: object

Method 3

1dtCols = list(df_companies.select_dtypes(include=['datetime']).columns)