BSE Weekly trend analysis using Pandas & Numpy
Its been a month since i have worked on python, last time it was during Udacity Data Enginnering Nano Degree and over the month i was fully occupied with Rexx and Db2 REST.
In this exercise, i am exploring what all the possible things one can do with BSE daily data, for this analysis i will be using 1yr worth of data. This not technical analysis, just a little exercise to improvise some Python skills(Upskilling).
After giving a lot of thought by looking at the data, i think i can get below metrics from the data, this itself took couple of days to think :D
|closeH||high close in the week|
|closeL||low closein the week|
|volHigh||Highest volume in the week|
|daysTraded||Number of days traded in the week|
|HSDL||Highest Single Day Loss|
|HSDG||Highest Single Day Gain|
|HSDLp||Highest Single Day Loss percent|
|HSDGp||Highest Single Day Gain percent|
|first||First close of the week|
|last||Last close of he week|
|wChngp||Week change percent|
|lastTrdDoW||Last traded day of week|
|volAvgWOhv||Volume average without high volume|
|HVdAV||High volume / Average volume(without highvolume)|
|CPveoHVD||Close positive on high volume day|
|lastDVotWk||Last day volume|
|lastDVdAV||Last day volume / average volume|
I was able to calculate and get data for all above fields in dataframe. Step-by-Step approach is available in Notebook in github.
To cut things short, first run took 1h 26min and second run took 1h 41min 41s via pandas to complete analysis for 1yr data. I didn't expect it would take that long to execute. I dont know where to start as well, so i posted my code in StackOverflow asking for suggestions and one user had asked.
Did you profile to see exactly which line(s) were the bottleneck(s)? – Mad Physicist
At this point i don't know,
- How to enable profiler
- How to install it
- What type of report does it produce
- How to read the report
Bit of googling, got me below links,
- Tried - StackOverflow - How do I use line_profiler (from Robert Kern)?
- Tried - line-profiler-code-example
Faced bit of a problem installing line_profiler in my system(Win10), so had to do a workaround, thats another topic, here.
Here is the output of the profiler,
Below are the sections of the code that have high time,
So started to try on alternative approaches to get things to run faster. I have been reading articles saying 'Numpy is fast', 'Vectors are fast' and started my tests.
Here are some of the attempts to solve the above performance problems,
Getting the ticker code
Had to settle with numpy here
3k to convert to numpy
Pandas approach seems to be bit expensive here and tried to convert it to numpy and faced a
TypeError: invalid type promotion. This error came because i was trying to concatenate arr_yearWeek(int64) with arr_ts(timestamp64[ns]) like
arr_concat = np.column_stack((arr_yearWeek, arr_ts))
To resolve this issue, i converted timestamp
df_temp['ts'] = df_temp['ts'] .dt.strftime('%Y-%m-%d')and this worked.
Here npi is
import numpy_indexed as npi, another library, when i was testing, i had encountered various issues and when i check in stackoverflow for solutions, i have seen user Eelco Hoogendoorn suggest to use
numpy_indexedin multiple posts and i had to try it out and also he is the author of the library as well.
So now most of my code uses this.
Converting to dataframe. Looks like creating a new dataframe is an expensive process. When i stepback and think about the flow of program, this function is called in a loop and for each company a new dataframe is being created and if this takes this much time for single company, obviously whole program is going to be slow. So my thought was, what-if i keep everything in numpy itself, and see how it goes.
There are two things to solve here
- Dataframe (Line no.165)
- Ordering of columns in dataframe(Line no.173 - did not think this process will consume this much time )
Combined above two problems into one solution. Since all the values are going to be of same size,
np.column_stackhelped resolve the issue
Filling same value to entire dataframe column
Resolved by filling the value in numpy
Last part, here there are multiple high timers
- 20k - Basically i am trying to define a dataframe with column, that process seems to be very expensive here.
- 6k - dataframe concatenation
- 1.5k - using .isin()
Thought process here is same as before, why should i use dataframe here, can i do everything in numpy. So, proceeding towards that angle.
Interesting concepts learned here are ,
- Slicing in numpy
- numpy.isin and invert is available
- compress of itertools
from itertools import compress
After making all the above changes in the new run, it took only 7min 47s.
Here is the new profiler report,
Needless to say, Hype is real. Numpy wins, its fast.
Issue faced during full execution using numpy,
MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32. This issue occured in two tests, first after processing 1600 companies and second time after 900 company. Since all the numpy elements should be of same type, here in my data, i have (integers, decimals and date). So, all the data is stored as OBJECT, which is bigger in size.
Tried couple of approaches,
- Tried to clear memory of used variables like
- Tested(doesn't help) to see if deleting user-defined variables help
- Thought, memory might have to do with python installation as well. My computer is 64-bit but python i had installed is 32-bit. So, tried reinstalling and checked it. It doesnt help,
- Finally, what i did is, write an intermediate result file after processing 500 companies and then reinitialize the variables which is holding that data.
# Things to do next
Going forward, should i write only numpy programs ?
No. It depends, to a certain extent. What am i doing here, i am calling weekly_trend_analysis() in loop passing last 1 yr daily data of each company as argument and function processes it and finally returns 1yr weekly data of that company and this process repeats for all companies.
- Doing this via Pandas takes 1h 26min
- Via Numpy takes 7min 47s
Instead of passing daily data of each company in loop, if i pass entire dataset to the function with all the companies via pandas, it takes only Wall time: 1min 22s.
So, pandas are good as well in processing large datasets. We got 2 enemies here, they should be avoided at all cost,
- Thought process that gets you to write a loop.
Couple of other points to remember,
- With complex codes, maintainence will hit hard. Pandas are easy to understand than numpy.
- If you really want to scale, it easy to convert from Pandas to PySpark, i have done it. (Easy meaning, you have to sit for couple of weeks and do it, but doable).
- Numpy is very fast with looping itself it got the elapsed time down from 1hr 26min to 7mins, what i didn't like about it is the memory error, but the problem is with my data, i wanted multiple datatypes, so it had to change everything to object types.
So finally, whats best, both are, but its the thought process thats going to save time.