Skip to content

bobby_dreamer

Python Nulls

python, pandas, notes6 min read

Null is just absence of a value in a variable. You can use null when you cannot specify any default value where any value would mean something.

Why its so important in python ? There are two ways to say a variable is null in Python. Its confusing and it causes issues unnecessarily and breaks stuff.

  • None
  • np.nan

# None

None is a object in python and objects are usually String class.

  • None is a keyword, just like True and False, so you cannot declare it as a variable.

  • None is a singleton. That is, the NoneType class will only point to same single instance of None in the program. You can create many variables and assign NONE to it and all the variables will point to same instance of None.

  • When checking whether a value is null or not null, should use identity operators(is, is not) rather than equality operators(==, !=). Sidetrack 1

  • None is falsy meaning it will be evaluated to false. If you want to know whether a condition is true/false. You can test like below,

    Truthy and Falsy are in Sidetrack 2

# np.nan

NaN means (Not-A-Number).

The IEEE-754 standard defines a NaN as a number with all ones in the exponent, and a non-zero significand. The highest-order bit in the significand specifies whether the NaN is a signaling or quiet one. The remaining bits of the significand form what is referred to as the payload of the NaN.

To know why it is like that refer Sidetrack 3

Multiple ways to check, whether a value is NaN. Recommendation is, if you are using Pandas use pandas, if you are using Numpy use numpy, if you are not using both use Math. Why ? import takes space, import math is around 2MB other two are > 10MB

All nulls/nans are not same

Why ? They all have different IDs.

Automatic conversions

Numpy/Pandas can convert column/series to float or object based on None/np.nan values, if you don't handle it.

Here, because of None the array is converted to Object dtype.

np.nan makes it a float64. Instead of NaN, if a numeric was there, it would have been int32

sum() function in both the places triggered different type of error. In object dtype, it throws a TypeError and in float64, it returned nan.

Below proves, you possibly cannot do any calculation, when you have NaN.

This is true, when you are not using pandas. See the below example, there is no error.

Why ? sum() in pandas has a option skipnabool and its default value is True. So by default, sum() will exclude all NA/null values when computing the result. So, when working in pandas its always better to check documentation is if any features or options available.

Handling Nulls(np.nan, None) in Pandas/Numpy

There are only few ways of handling nulls, they are 1. Ignoring nulls 1. Identifying nulls 1. Dropping nulls rows/columns 1. Replace nulls with some other values

Ignoring nulls

Below are some ways numpy provides to ignore nans and perform simple calculations.

Identifying nulls

Dropping nulls rows/columns

Here you can observe that you cannot drop single value from a DataFrame. In the below example you can see, a entire row getting removed. Options are availble to entire column as well. Sometimes this type of result may not be desirable.

df.dropna() has multiple options,

  • df.dropna(axis='columns') : drops all columns that has a null value. Instead of axis='columns', axis=1 can be mentioned.
  • df.dropna(axis='rows') : drops all rows that has a null value. Instead of axis='rows', axis=0 can be mentioned.
  • df.dropna(how='any') : (default). Default axis is rows
  • df.dropna(how='all') : Drop rows or columns which has all nulls, by default it drops rows(axis=0).

To have more control on the non-values to be kept, you can specify `thresh=2', having 2 as its parameter means, atleast 2 non-null values should be there in the row/column.

Replace nulls with some other values

Above we have seen filling nulls for the full dataframe. It can be filled column-wise as well. Below are some examples

Counting number of nulls in row/columns

# Sidetrack 1 : Identity Operators Vs. Equality Operators

  1. Identity operator : We can use identity operation to check data type of a variable
  • Two identity operators available are is and is not

  1. Equality operator : Checks whether the two values are equal(which is defined from object to object) - Two operators available are == and !=

Its not recommended because PEP 8 says so :

"Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators."

Check the below example, copied from realpython.com

The equality operators can be fooled when you’re comparing user-defined objects that override them. Here, the equality operator == returns the wrong answer. The identity operator is, on the other hand, can’t be fooled because you can’t override it.

This again works, but not recommended. So should be careful, not to use it. None by definition is absence of value (null). Here whats happening is comparing the id() of None, which is going to exact same memory location, so None comparison becomes True. Python tests object's identity first meaning it checks whether the objects are found at the same memory address.

Expanding the above bit more. Check[ Ref1 ]

To check if the item is in the list, Python tests for object identity first, and then tests for equality only if the objects are different.

More Comparisons

This is always false, we should learn to live with it. Good points here by Stephen Canon.

Comparison 2
Comparison 3

# Sidetrack 2 : Truthy and Falsy values

When you are comparing values, there can be only two results, True or False which is a boolean and as of now i dont think there is a programming language supporting Not-a-Boolean(NaB). Usually expressions evaluate to these values.

We can test expressions like below without operators,

What happened in second example is because of the Concept of Truthy & Falsy. Here,

  • any condition that evaluate to false are falsy
  • any condition that evaluate to true are truthy

Below are falsy values

  • empty lists : []
  • empty tuples : ()
  • empty dictionaries : {}
  • empty sets : set()
  • blank string : "", ''
  • Number 0 : 0, 0.0
  • Boolean : False

Below are truthy values

  • non-empty data structures (lists, tuples, dictionaries, sets, strings)
  • non-zero numeric values
  • Boolean ( True )

Simple example in using Truthy

# Sidetrack 3 : Possible reason why np.nan==np.nan is False

This is from Reflexivity, and other pillars of civilization . This is a good read.

Equality is reflexive (every value is equal to itself, at any longitude and temperature, no excuses and no exceptions); and the purpose of assignment is to make the value of the target equal to the value of the source.

754 enters the picture

Now assume that the value of x is a NaN. If you use a programming language that supports IEEE 754 (as they all do, I think, today) the test in

if x = x then …

is supposed to yield False. Yes, that is specified in the standard: NaN is never equal to NaN (even with the same payload); nor is it equal to anything else; the result of an equality comparison involving NaN will always be False.

I am by no means a numerics expert; I know that IEEE 754 was a tremendous advance, and that it was designed by some of the best minds in the field, headed by Velvel Kahan who received a Turing Award in part for that success.

Why the result is False ? The conclusion is not that the result should be False. The rational conclusion is that True and False are both unsatisfactory solutions. The reason is very simple: in a proper theory (I will sketch it below) the result of such a comparison should be some special undefined below; the same way that IEEE 754 extends the set of floating-point numbers with NaN, a satisfactory solution would extend the set of booleans with some NaB (Not a Boolean). But there is no NaB, probably because no one (understandably) wanted to bother, and also because being able to represent a value of type BOOLEAN on a single bit is, if not itself a pillar of civilization, one of the secrets of a happy life.

If both True and False are unsatisfactory solutions, we should use the one that is the “least” bad, according to some convincing criterion . That is not the attitude that 754 takes; it seems to consider (as illustrated by the justification cited above) that False is somehow less committing than True. But it is not! Stating that something is false is just as much of a commitment as stating that it is true. False is no closer to NaB than True is. A better criterion is: which of the two possibilities is going to be least damaging to time-honored assumptions embedded in mathematics? One of these assumptions is the reflexivity of equality: come rain or shine, x is equal to itself. Its counterpart for programming is that after an assignment the target will be equal to the original value of the source. This applies to numbers, and it applies to a NaN as well.

Note that this argument does not address equality between different NaNs. The standard as it is states that a specific NaN, with a specific payload, is not equal to itself.


And this is where i stopped and decided not to go further in this subject.

References

  1. S.O : in operator, float(“NaN”) and np.nan
  2. S.O : What is the rationale for all comparisons returning false for IEEE754 NaN values?
  3. Reflexivity, and other pillars of civilization
  4. S.O : More NaN Wars : Why is NaN not equal to NaN?
  5. Jake VanderPlas : Handling Missing Data