Skip to content

bobby_dreamer

BSE Weekly trend analysis using Pandas & Numpy

python, jupyter, numpy4 min read

Its been a month since i have worked on python, last time it was during Udacity Data Enginnering Nano Degree and over the month i was fully occupied with Rexx and Db2 REST.

In this exercise, i am exploring what all the possible things one can do with BSE daily data, for this analysis i will be using 1yr worth of data. This not technical analysis, just a little exercise to improvise some Python skills(Upskilling).

Data available here

After giving a lot of thought by looking at the data, i think i can get below metrics from the data, this itself took couple of days to think :D

FieldDescription
closeHhigh close in the week
closeLlow closein the week
volHighHighest volume in the week
volAvgVolume average
daysTradedNumber of days traded in the week
HSDLHighest Single Day Loss
HSDGHighest Single Day Gain
HSDLpHighest Single Day Loss percent
HSDGpHighest Single Day Gain percent
firstFirst close of the week
lastLast close of he week
wChngWeek change
wChngpWeek change percent
lastTrdDoWLast traded day of week
TITimes increased
volAvgWOhvVolume average without high volume
HVdAVHigh volume / Average volume(without highvolume)
CPveoHVDClose positive on high volume day
lastDVotWkLast day volume
lastDVdAVLast day volume / average volume

I was able to calculate and get data for all above fields in dataframe. Step-by-Step approach is available in Notebook in github.

To cut things short, first run took 1h 26min and second run took 1h 41min 41s via pandas to complete analysis for 1yr data. I didn't expect it would take that long to execute. I dont know where to start as well, so i posted my code in StackOverflow asking for suggestions and one user had asked.

Did you profile to see exactly which line(s) were the bottleneck(s)? – Mad Physicist

At this point i don't know,

  • How to enable profiler
  • How to install it
  • What type of report does it produce
  • How to read the report

Bit of googling, got me below links,

Faced bit of a problem installing line_profiler in my system(Win10), so had to do a workaround, thats another topic, here.

Here is the output of the profiler,

Below are the sections of the code that have high time,

So started to try on alternative approaches to get things to run faster. I have been reading articles saying 'Numpy is fast', 'Vectors are fast' and started my tests.

Here are some of the attempts to solve the above performance problems,

  1. Getting the ticker code

    Had to settle with numpy here

  2. 3k to convert to numpy

    Pandas approach seems to be bit expensive here and tried to convert it to numpy and faced a TypeError: invalid type promotion. This error came because i was trying to concatenate arr_yearWeek(int64) with arr_ts(timestamp64[ns]) like arr_concat = np.column_stack((arr_yearWeek, arr_ts))

    To resolve this issue, i converted timestamp ts to string df_temp['ts'] = df_temp['ts'] .dt.strftime('%Y-%m-%d') and this worked.

    Here npi is import numpy_indexed as npi, another library, when i was testing, i had encountered various issues and when i check in stackoverflow for solutions, i have seen user Eelco Hoogendoorn suggest to use numpy_indexed in multiple posts and i had to try it out and also he is the author of the library as well.

    So now most of my code uses this.

  3. Converting to dataframe. Looks like creating a new dataframe is an expensive process. When i stepback and think about the flow of program, this function is called in a loop and for each company a new dataframe is being created and if this takes this much time for single company, obviously whole program is going to be slow. So my thought was, what-if i keep everything in numpy itself, and see how it goes.

    There are two things to solve here

    • Dataframe (Line no.165)
    • Ordering of columns in dataframe(Line no.173 - did not think this process will consume this much time )

    Combined above two problems into one solution. Since all the values are going to be of same size, np.column_stack helped resolve the issue

  4. Filling same value to entire dataframe column

    Resolved by filling the value in numpy

  5. Last part, here there are multiple high timers

    • 20k - Basically i am trying to define a dataframe with column, that process seems to be very expensive here.
    • 6k - dataframe concatenation
    • 1.5k - using .isin()

    Thought process here is same as before, why should i use dataframe here, can i do everything in numpy. So, proceeding towards that angle.

    Interesting concepts learned here are ,

    1. Slicing in numpy
    2. numpy.isin and invert is available
    3. compress of itertools from itertools import compress
    4. vstack

    After making all the above changes in the new run, it took only 7min 47s.

Here is the new profiler report,

Needless to say, Hype is real. Numpy wins, its fast.

Issue faced during full execution using numpy,

  1. MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32. This issue occured in two tests, first after processing 1600 companies and second time after 900 company. Since all the numpy elements should be of same type, here in my data, i have (integers, decimals and date). So, all the data is stored as OBJECT, which is bigger in size.

    Tried couple of approaches,

    • Tried to clear memory of used variables like
    • Tested(doesn't help) to see if deleting user-defined variables help
    • Thought, memory might have to do with python installation as well. My computer is 64-bit but python i had installed is 32-bit. So, tried reinstalling and checked it. It doesnt help,
    • Finally, what i did is, write an intermediate result file after processing 500 companies and then reinitialize the variables which is holding that data.

# Things to do next

Going forward, should i write only numpy programs ?

No. It depends, to a certain extent. What am i doing here, i am calling weekly_trend_analysis() in loop passing last 1 yr daily data of each company as argument and function processes it and finally returns 1yr weekly data of that company and this process repeats for all companies.

  • Doing this via Pandas takes 1h 26min
  • Via Numpy takes 7min 47s

Instead of passing daily data of each company in loop, if i pass entire dataset to the function with all the companies via pandas, it takes only Wall time: 1min 22s.

So, pandas are good as well in processing large datasets. We got 2 enemies here, they should be avoided at all cost,

  1. Looping
  2. Thought process that gets you to write a loop.

Couple of other points to remember,

  1. With complex codes, maintainence will hit hard. Pandas are easy to understand than numpy.
  2. If you really want to scale, it easy to convert from Pandas to PySpark, i have done it. (Easy meaning, you have to sit for couple of weeks and do it, but doable).
  3. Numpy is very fast with looping itself it got the elapsed time down from 1hr 26min to 7mins, what i didn't like about it is the memory error, but the problem is with my data, i wanted multiple datatypes, so it had to change everything to object types.

So finally, whats best, both are, but its the thought process thats going to save time.

# Related articles

  1. Running python line_profiler in cloud shell
  2. Github : BSE Trend analysis using Pandas(Notebook)
  3. Github : BSE Trend analysis using Numpy(Notebook)