Skip to content

bobby_dreamer

Python Nulls

python, pandas, notes6 min read

Null is just absence of a value in a variable. You can use null when you cannot specify any default value where any value would mean something.

1>>> def has_no_return(): #<- Defining a function which doesn't return anything
2... pass
3...
4>>> has_no_return() #<- When called, it doesn't return anything as expected
5>>> print(has_no_return()) #<- When the function called using print(), which actually needs to print something
6None #<- prints NONE as function didnt return anything, it printed NONE
7>>> # a hidden value called None

Why its so important in python ? There are two ways to say a variable is null in Python. Its confusing and it causes issues unnecessarily and breaks stuff.

  • None
  • np.nan

# None

None is a object in python and objects are usually String class.

1>>> type(None)
2<class 'NoneType'>
  • None is a keyword, just like True and False, so you cannot declare it as a variable.

  • None is a singleton. That is, the NoneType class will only point to same single instance of None in the program. You can create many variables and assign NONE to it and all the variables will point to same instance of None.

    1>>> id(None)
    21560644480
    3>>> a = None
    4>>> b = None
    5>>> id(a)
    61560644480
    7>>> id(b)
    81560644480
  • When checking whether a value is null or not null, should use identity operators(is, is not) rather than equality operators(==, !=). Sidetrack 1

  • None is falsy meaning it will be evaluated to false. If you want to know whether a condition is true/false. You can test like below,

    1>>> a = 'hi'
    2>>> if a: #<- 'a' has 'hi' value.
    3... print(a)
    4... else:
    5... print('Other than a')
    6...
    7hi #<- if condition tested to True and printed 'hi'
    8
    9>>> a=''
    10>>> if a: #<- 'a' has blank value
    11... print(a)
    12... else:
    13... print('Other than a')
    14...
    15Other than a #<- if condition tested false. Actually it should have printed <blank>, right?.
    16 # What happened ? Falsy

    Truthy and Falsy are in Sidetrack 2

# np.nan

NaN means (Not-A-Number).

The IEEE-754 standard defines a NaN as a number with all ones in the exponent, and a non-zero significand. The highest-order bit in the significand specifies whether the NaN is a signaling or quiet one. The remaining bits of the significand form what is referred to as the payload of the NaN.

1>>> import numpy as np
2>>> np.nan==np.nan #<- It is how it is.
3False

To know why it is like that refer Sidetrack 3

Multiple ways to check, whether a value is NaN. Recommendation is, if you are using Pandas use pandas, if you are using Numpy use numpy, if you are not using both use Math. Why ? import takes space, import math is around 2MB other two are > 10MB

1import pandas as pd
2import numpy as np
3import math
4
5#For single variable all three libraries return single boolean
6x1 = float("nan")
7
8print("It's pd.isna : {}".format(pd.isna(x1)) )
9print("It's pd.isnull : {}".format(pd.isnull(x1)) )
10print("It's np.isnan : {}".format(np.isnan(x1)) )
11print("It's math.isnan : {}".format(math.isnan(x1)) )
12
13# Output
14It's pd.isna : True
15It's pd.isnull : True
16It's np.isnan : True
17It's math.isnan : True

All nulls/nans are not same

1print(math.nan is math.nan) #<- True
2print(math.nan is np.nan) #<- False
3print(math.nan is float('nan')) #<- False

Why ? They all have different IDs.

1>>> id(math.nan), id(np.nan), id(float('nan'))
2(32474464, 32473712, 225025248)

Automatic conversions

Numpy/Pandas can convert column/series to float or object based on None/np.nan values, if you don't handle it.

Here, because of None the array is converted to Object dtype.

1>>> vals1 = np.array([1, None, 3, 4])
2>>> vals1
3array([1, None, 3, 4], dtype=object)
4>>> vals1.sum()
5Traceback (most recent call last):
6 File "<stdin>", line 1, in <module>
7 File "C:\Users\Sushanth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\numpy\core\_methods.py", line 38, in _sum
8 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
9TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
10>>>

np.nan makes it a float64. Instead of NaN, if a numeric was there, it would have been int32

1>>> vals1 = np.array([1, np.nan, 3, 4])
2>>> vals1
3array([ 1., nan, 3., 4.])
4>>> type(vals1), vals1.dtype
5(<class 'numpy.ndarray'>, dtype('float64'))
6>>> vals1.sum()
7nan

sum() function in both the places triggered different type of error. In object dtype, it throws a TypeError and in float64, it returned nan.

Below proves, you possibly cannot do any calculation, when you have NaN.

1>>> 1 + np.nan
2nan

This is true, when you are not using pandas. See the below example, there is no error.

1>>> import pandas as pd
2>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4]})
3>>> df['A'].sum()
410
5>>> df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan]})
6>>> df['A'].sum()
76.0

Why ? sum() in pandas has a option skipnabool and its default value is True. So by default, sum() will exclude all NA/null values when computing the result. So, when working in pandas its always better to check documentation is if any features or options available.

Handling Nulls(np.nan, None) in Pandas/Numpy

There are only few ways of handling nulls, they are

  1. Ignoring nulls
  2. Identifying nulls
  3. Dropping nulls rows/columns
  4. Replace nulls with some other values

Ignoring nulls

Below are some ways numpy provides to ignore nans and perform simple calculations.

1>>> vals1 = np.array([1, np.nan, 3, 4])
2>>> np.nansum(vals1), np.nanmin(vals1), np.nanmax(vals1)
3(8.0, 1.0, 4.0)

Identifying nulls

1data = pd.Series([1, np.nan, 3, 4])
2>>> data.isnull()
30 False
41 True
52 False
63 False
7dtype: bool
8
9>>> data[data.isnull()] #<- data.isna() also gives the same result
101 NaN
11dtype: float64
12
13>>> data[data.notnull()]
140 1.0
152 3.0
163 4.0
17dtype: float64
18>>>

Dropping nulls rows/columns

1>>> data = pd.Series([1, np.nan, 3, 4])
2>>> data.dropna()
30 1.0 #<- Here data in index[1] is dropped
42 3.0
53 4.0
6dtype: float64
7>>>

Here you can observe that you cannot drop single value from a DataFrame. In the below example you can see, a entire row getting removed. Options are availble to entire column as well. Sometimes this type of result may not be desirable.

1>>> df = pd.DataFrame([[1, np.nan, 5],
2... [2, 3, 6],
3... [np.nan, 4, 7]])
4
5>>> df
6 0 1 2
70 1.0 NaN 5
81 2.0 3.0 6
92 NaN 4.0 7
10
11>>> df.dropna()
12 0 1 2
131 2.0 3.0 6

df.dropna() has multiple options,

  • df.dropna(axis='columns') : drops all columns that has a null value. Instead of axis='columns', axis=1 can be mentioned.
  • df.dropna(axis='rows') : drops all rows that has a null value. Instead of axis='rows', axis=0 can be mentioned.
  • df.dropna(how='any') : (default). Default axis is rows
  • df.dropna(how='all') : Drop rows or columns which has all nulls, by default it drops rows(axis=0).
1>>> df = pd.DataFrame([[1, np.nan, 5],
2... [2, 3, 6],
3... [np.nan, 4, 7]])
4
5>>> df.dropna(axis='columns')
6 2
70 5
81 6
92 7
10
11>>> df.dropna(axis='rows')
12 0 1 2
131 2.0 3.0 6
14
15>>> df.dropna(how='any')
16 0 1 2
171 2.0 3.0 6
18
19>>> df = pd.DataFrame([[np.nan, np.nan, np.nan],
20... [np.nan, 3, 6],
21... [np.nan, 4, 7]])
22>>>
23>>> df.dropna(how='all') #<- It dropped first row
24 0 1 2
251 NaN 3.0 6.0
262 NaN 4.0 7.0
27>>>
28>>> df.dropna(how='all', axis=1) #<- It dropped first column
29 1 2
300 NaN NaN
311 3.0 6.0
322 4.0 7.0

To have more control on the non-values to be kept, you can specify `thresh=2', having 2 as its parameter means, atleast 2 non-null values should be there in the row/column.

1df = pd.DataFrame([[np.nan, np.nan, np.nan],
2 [np.nan, 3, 6],
3 [np.nan, 4, 7]])
4
5>>> df.dropna(thresh=2) #<- Default axis='rows' or axis=0, so first row is dropped
6 0 1 2
71 NaN 3.0 6.0
82 NaN 4.0 7.0
9>>>
10>>> df.dropna(axis=1, thresh=2)
11 1 2
120 NaN NaN
131 3.0 6.0
142 4.0 7.0

Replace nulls with some other values

1# This option fills all nulls to a predefined value.
2>>> data = pd.Series([1, np.nan, 3, 4])
3>>> data.fillna(0)
40 1.0
51 0.0
62 3.0
73 4.0
8dtype: float64
9
10>>> data.fillna(method='bfill') #<- bfill is backward fill. Data in index[2] is filled in index[1]
110 1.0
121 3.0
132 3.0
143 4.0
15dtype: float64
16
17>>> df = pd.DataFrame([[np.nan, 1, np.nan],
18... [2, np.nan, 3],
19... [np.nan, 4, 7]])
20>>>
21>>> df.fillna(method='ffill', axis=1) # Here we are forward filling at column level( Left -> Right )
22 0 1 2 #<- df[column][row]
230 NaN 1.0 1.0 #<- Data in df[1][0] is filled in df[2][0]
241 2.0 2.0 3.0 #<- Data in df[0][1] is filled in df[1][1]
252 NaN 4.0 7.0 #<- There is nothing to forward fill as nan is in df[0][2]
26
27>>> df.fillna(method='bfill', axis=0) # Here we are backward filling at row level (Bottom to Top)
28 0 1 2
290 2.0 1.0 3.0 #<- Data in df[0][1] & df[2][1] is filled in df[0][0] & df[2][0]
301 2.0 4.0 3.0 #<- Data in df[1][2] is filled in df[1][1]
312 NaN 4.0 7.0 #<- Null in df[0][2] is left as in.

Above we have seen filling nulls for the full dataframe. It can be filled column-wise as well. Below are some examples

1# Pandas
2df['col1'] = df['col1'].fillna(0)
3
4# Numpy
5df['col1'] = df['col1'].replace(np.nan, 0)

Counting number of nulls in row/columns

1>>> df = pd.DataFrame([[1 , 1, np.nan],
2... [np.nan, np.nan, np.nan],
3... [np.nan, 4, 7]])
4>>> df
5 0 1 2
60 1.0 1.0 NaN
71 NaN NaN NaN
82 NaN 4.0 7.0
9>>> df.isnull().sum(axis=1) #<- Counts all nulls in columns by row
100 1
111 3
122 1
13dtype: int64
14
15>>> df.isnull().sum(axis=0) #<- Counts all the nulls in rows by column
160 2
171 1
182 2
19dtype: int64

# Sidetrack 1 : Identity Operators Vs. Equality Operators

  1. Identity operator : We can use identity operation to check data type of a variable
  • Two identity operators available are is and is not

    1>>> a = 'hi'
    2>>> b = 'hello'
    3>>> id(a)
    421506528 #<- Variable a's ID
    5>>> id(b)
    621100928 #<- Variable b's ID
    7
    8>>> type(a)
    9<class 'str'> #<- data type of variable a is string
    10
    11>>> id(str)
    121560662608 #<- ID of str
    13
    14>>> id(type(a))
    151560662608 #<- ID of data type of variable a. It is the ID of str class
    16>>> id(type(b))
    171560662608 #<- ID of data type of variable b. It is the ID of str class
    18
    19>>> type(a) is str
    20True #<- So obviously, its going to be true.
    21
    22>>> b = 1 #<- Assigning integer value to variable b
    23>>> id(b)
    241560762640 #<- Now variable b has different ID
    25>>> type(b)
    26<class 'int'>
    27>>> type(b) is not str
    28True #<- Now we definitely know that variable b is not a string
  1. Equality operator : Checks whether the two values are equal(which is defined from object to object)

    • Two operators available are == and !=
    1>>> a = 'hi'
    2>>> print(a)
    3hi
    4>>> a is None
    5False
    6>>> a == None
    7False
    8>>> a != None #<- At this point you can think, to check for None, you can use equality operator itself.
    9True # Its not recomended

Its not recommended because PEP 8 says so :

"Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators."

Check the below example, copied from realpython.com

1>>> class BrokenComparison:
2... def __eq__(self, other):
3... return True
4...
5>>> b = BrokenComparison()
6>>> b == None
7True

The equality operators can be fooled when you’re comparing user-defined objects that override them. Here, the equality operator == returns the wrong answer. The identity operator is, on the other hand, can’t be fooled because you can’t override it.

This again works, but not recommended. So should be careful, not to use it. None by definition is absence of value (null). Here whats happening is comparing the id() of None, which is going to exact same memory location, so None comparison becomes True. Python tests object's identity first meaning it checks whether the objects are found at the same memory address.

1>>> None==None
2True
3>>> id(None) #<- Both the None will be pointing to same id()
41560644480

Expanding the above bit more. Check[ Ref1 ]

1>>> nans = [None for i in range(2)] #<- Adding None to list
2>>> list(map(id, nans)) #<- printing the id()'s of None
3[1539541888, 1539541888] #<- As expected they have same ID
4
5>>> nans = [np.nan for i in range(2)] #<- Adding numpy.nan to list
6>>> list(map(id, nans)) #<- printing the id()'s of np.nan
7[32473712, 32473712] #<- As expected they have same ID
8
9>>> nans = [float("NaN") for i in range(2)] #<- Adding float("NaN") to list and thing to remember
10>>> list(map(id, nans)) #<- is each call to float("NaN") creates a new object.
11[26864592, 201935840] #<- Different ID's meaning they are different objects.

To check if the item is in the list, Python tests for object identity first, and then tests for equality only if the objects are different.

1>>> nans = [None, np.nan, float("NaN")]
2>>> None in nans #<- Object identity will return True and Python recognises the item in the list.
3True
4>>> np.nan in nans #<- Object identity will return True and Python recognises the item in the list.
5True
6>>> float("NaN") in nans #<- False because two different NaN objects as you can see in above map example
7False
8>>> fnan = float("NaN") #<- This is obviously true because you are refering to same item.
9>>> fnan in [fnan]
10True
More Comparisons

This is always false, we should learn to live with it. Good points here by Stephen Canon.

1>>> np.nan==np.nan
2False
Comparison 2
1>>> a=np.array([2, [3], 4])
2>>> a[1]==[3]
3True
4>>> a==[3]
5array([False, True, False])
6
7>>> b = np.array([None,[np.nan]])
8>>> b[1]==[np.nan] #<- Comparing two lists and same NaN object and id() are compared.
9True
10>>> b==[np.nan] #<- Here in the comparison, its False. Numpy checks values both are different.
11array([False, False])
Comparison 3
1>>> lst = [1,2,3]
2>>> id(lst)
3194154944
4>>> lst == lst[:]
5True # <- This is True since the lists are "equivalent"
6>>> lst is lst[:]
7False # <- This is False since they're actually different objects
8>>> id(lst[:])
9194156064
10>>>

# Sidetrack 2 : Truthy and Falsy values

When you are comparing values, there can be only two results, True or False which is a boolean and as of now i dont think there is a programming language supporting Not-a-Boolean(NaB). Usually expressions evaluate to these values.

We can test expressions like below without operators,

1a = 10
2if a:
3 print(a)
4else:
5 print('i hope variable has a value initialized')
6
7# Output
810
9
10a = 0
11if a:
12 print(a)
13else:
14 print('i hope variable has a value initialized')
15
16# Output
17i hope variable has a value initialized

What happened in second example is because of the Concept of Truthy & Falsy. Here,

  • any condition that evaluate to false are falsy
  • any condition that evaluate to true are truthy

Below are falsy values

  • empty lists : []
  • empty tuples : ()
  • empty dictionaries : {}
  • empty sets : set()
  • blank string : "", ''
  • Number 0 : 0, 0.0
  • Boolean : False

Below are truthy values

  • non-empty data structures (lists, tuples, dictionaries, sets, strings)
  • non-zero numeric values
  • Boolean ( True )

Simple example in using Truthy

1name = "sushanth"
2if len(name) > 0 :
3 print('Hello {}'.format(name))
4else:
5 print('Wassap')
6
7name = "sushanth"
8if name:
9 print('Hello {}'.format(name))
10else:
11 print('Wassap')

# Sidetrack 3 : Possible reason why np.nan==np.nan is False

This is from Reflexivity, and other pillars of civilization . This is a good read.

Equality is reflexive (every value is equal to itself, at any longitude and temperature, no excuses and no exceptions); and the purpose of assignment is to make the value of the target equal to the value of the source.

754 enters the picture

Now assume that the value of x is a NaN. If you use a programming language that supports IEEE 754 (as they all do, I think, today) the test in

if x = x then …

is supposed to yield False. Yes, that is specified in the standard: NaN is never equal to NaN (even with the same payload); nor is it equal to anything else; the result of an equality comparison involving NaN will always be False.

I am by no means a numerics expert; I know that IEEE 754 was a tremendous advance, and that it was designed by some of the best minds in the field, headed by Velvel Kahan who received a Turing Award in part for that success.

Why the result is False ? The conclusion is not that the result should be False. The rational conclusion is that True and False are both unsatisfactory solutions. The reason is very simple: in a proper theory (I will sketch it below) the result of such a comparison should be some special undefined below; the same way that IEEE 754 extends the set of floating-point numbers with NaN, a satisfactory solution would extend the set of booleans with some NaB (Not a Boolean). But there is no NaB, probably because no one (understandably) wanted to bother, and also because being able to represent a value of type BOOLEAN on a single bit is, if not itself a pillar of civilization, one of the secrets of a happy life.

If both True and False are unsatisfactory solutions, we should use the one that is the “least” bad, according to some convincing criterion . That is not the attitude that 754 takes; it seems to consider (as illustrated by the justification cited above) that False is somehow less committing than True. But it is not! Stating that something is false is just as much of a commitment as stating that it is true. False is no closer to NaB than True is. A better criterion is: which of the two possibilities is going to be least damaging to time-honored assumptions embedded in mathematics? One of these assumptions is the reflexivity of equality: come rain or shine, x is equal to itself. Its counterpart for programming is that after an assignment the target will be equal to the original value of the source. This applies to numbers, and it applies to a NaN as well.

Note that this argument does not address equality between different NaNs. The standard as it is states that a specific NaN, with a specific payload, is not equal to itself.


And this is where i stopped and decided not to go further in this subject.

References

  1. S.O : in operator, float(“NaN”) and np.nan
  2. S.O : What is the rationale for all comparisons returning false for IEEE754 NaN values?
  3. Reflexivity, and other pillars of civilization
  4. S.O : More NaN Wars : Why is NaN not equal to NaN?
  5. Jake VanderPlas : Handling Missing Data