Skip to content

bobby_dreamer

BSE Weekly trend analysis using Pandas & Numpy

python, jupyter, numpy4 min read

Its been a month since i have worked on python, last time it was during Udacity Data Enginnering Nano Degree and over the month i was fully occupied with Rexx and Db2 REST.

In this exercise, i am exploring what all the possible things one can do with BSE daily data, for this analysis i will be using 1yr worth of data. This not technical analysis, just a little exercise to improvise some Python skills(Upskilling).

Data available here

After giving a lot of thought by looking at the data, i think i can get below metrics from the data, this itself took couple of days to think :D

FieldDescription
closeHhigh close in the week
closeLlow closein the week
volHighHighest volume in the week
volAvgVolume average
daysTradedNumber of days traded in the week
HSDLHighest Single Day Loss
HSDGHighest Single Day Gain
HSDLpHighest Single Day Loss percent
HSDGpHighest Single Day Gain percent
firstFirst close of the week
lastLast close of he week
wChngWeek change
wChngpWeek change percent
lastTrdDoWLast traded day of week
TITimes increased
volAvgWOhvVolume average without high volume
HVdAVHigh volume / Average volume(without highvolume)
CPveoHVDClose positive on high volume day
lastDVotWkLast day volume
lastDVdAVLast day volume / average volume

I was able to calculate and get data for all above fields in dataframe. Step-by-Step approach is available in Notebook in github.

To cut things short, first run took 1h 26min and second run took 1h 41min 41s via pandas to complete analysis for 1yr data. I didn't expect it would take that long to execute. I dont know where to start as well, so i posted my code in StackOverflow asking for suggestions and one user had asked.

Did you profile to see exactly which line(s) were the bottleneck(s)? – Mad Physicist

At this point i don't know,

  • How to enable profiler
  • How to install it
  • What type of report does it produce
  • How to read the report

Bit of googling, got me below links,

Faced bit of a problem installing line_profiler in my system(Win10), so had to do a workaround, thats another topic, here.

Here is the output of the profiler,

1('getcwd : ', '/home/bobby_dreamer')
2Timer unit: 1e-06 s
3
4Total time: 0.043637 s
5File: BTD-Analysis1V3.py
6Function: weekly_trend_analysis at line 36
7
8Line # Hits Time Per Hit % Time Line Contents
9==============================================================
10 36 def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
11 37
12 38 1 3.0 3.0 0.0 if exchange == 'BSE':
13 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
14 40 else:
15 41 ticker = df_daily.iloc[0]['symbol']
16 42
17 43 1 201.0 201.0 0.5 arr_yearWeek = df_daily['yearWeek'].to_numpy()
18 44 1 100.0 100.0 0.2 arr_close = df_daily['close'].to_numpy()
19 45 1 87.0 87.0 0.2 arr_prevclose = df_daily['prevclose'].to_numpy()
20 46 1 85.0 85.0 0.2 arr_chng = df_daily['chng'].to_numpy()
21 47 1 83.0 83.0 0.2 arr_chngp = df_daily['chngp'].to_numpy()
22 48 1 108.0 108.0 0.2 arr_ts = df_daily['ts'].to_numpy()
23 49 1 89.0 89.0 0.2 arr_volumes = df_daily['volumes'].to_numpy()
24 50
25 51 # Close
26 52 1 41.0 41.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_close))
27 53 1 241.0 241.0 0.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
28 54
29 55 #a = df_temp[['yearWeek', 'close']].to_numpy()
30 56 1 113.0 113.0 0.3 yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
31 57
32 58 1 4.0 4.0 0.0 cmaxs, cmins = [], []
33 59 1 3.0 3.0 0.0 first, last, wChng, wChngp = [], [], [], []
34 60 2 11.0 5.5 0.0 for idx,subarr in enumerate(npi_gb):
35 61 1 32.0 32.0 0.1 cmaxs.append( np.amax(subarr) )
36 62 1 17.0 17.0 0.0 cmins.append( np.amin(subarr) )
37 63 1 2.0 2.0 0.0 first.append(subarr[0])
38 64 1 2.0 2.0 0.0 last.append(subarr[-1])
39 65 1 3.0 3.0 0.0 wChng.append( subarr[-1] - subarr[0] )
40 66 1 6.0 6.0 0.0 wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
41 67
42 68 #npi_gb.clear()
43 69 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))
44 70
45 71 # Chng
46 72 1 21.0 21.0 0.0 arr_concat = np.column_stack((arr_yearWeek, arr_chng))
47 73 1 109.0 109.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
48 74
49 75 1 2.0 2.0 0.0 HSDL, HSDG = [], []
50 76 2 7.0 3.5 0.0 for idx,subarr in enumerate(npi_gb):
51 77 1 12.0 12.0 0.0 HSDL.append( np.amin(subarr) )
52 78 1 9.0 9.0 0.0 HSDG.append( np.amax(subarr) )
53 79
54 80 #npi_gb.clear()
55 81 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))
56 82
57 83 # Chngp
58 84 1 15.0 15.0 0.0 arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
59 85 1 86.0 86.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
60 86
61 87 1 1.0 1.0 0.0 HSDLp, HSDGp = [], []
62 88 2 7.0 3.5 0.0 for idx,subarr in enumerate(npi_gb):
63 89 1 11.0 11.0 0.0 HSDLp.append( np.amin(subarr) )
64 90 1 9.0 9.0 0.0 HSDGp.append( np.amax(subarr) )
65 91
66 92 #npi_gb.clear()
67 93 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))
68 94
69 95 # Last Traded Date of the Week
70 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
71 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
72 98
73 99 1 2.0 2.0 0.0 lastTrdDoW = []
74 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):
75 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )
76 102
77 103 1 4.0 4.0 0.0 i = np.empty((100,100))
78 104 #j.clear()
79 105
80 106 # Times inreased
81 107 1 11.0 11.0 0.0 TI = np.where(arr_close > arr_prevclose, 1, 0)
82 108
83 109 # Below npi_gb_yearWeekTI is used in volumes section
84 110 1 19.0 19.0 0.0 arr_concat = np.column_stack((arr_yearWeek, TI))
85 111 1 111.0 111.0 0.3 npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
86 112
87 113 1 73.0 73.0 0.2 tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
88 114
89 115 # Volume ( dependent on above section value t_group , thats the reason to move from top to here)
90 116 1 39.0 39.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
91 117 1 94.0 94.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
92 118
93 119 1 2.0 2.0 0.0 vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
94 120 2 8.0 4.0 0.0 for idx,subarr in enumerate(npi_gb):
95 121 1 53.0 53.0 0.1 vavgs.append( np.mean(subarr) )
96 122 1 2.0 2.0 0.0 ldvotWk = subarr[-1]
97 123 1 2.0 2.0 0.0 lastDVotWk.append(ldvotWk)
98 124
99 125 #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
100 126 1 13.0 13.0 0.0 ixDel = np.argmax(subarr)
101 127 1 2.0 2.0 0.0 hV = subarr[ixDel]
102 128 1 2.0 2.0 0.0 vmaxs.append( hV )
103 129
104 130 1 1.0 1.0 0.0 if(len(subarr)>1):
105 131 1 53.0 53.0 0.1 subarr = np.delete(subarr, ixDel)
106 132 1 29.0 29.0 0.1 vawoHV = np.mean(subarr)
107 133 else:
108 134 vawoHV = np.mean(subarr)
109 135 1 2.0 2.0 0.0 volAvgWOhv.append( vawoHV )
110 136 1 12.0 12.0 0.0 HVdAV.append(hV / vawoHV)
111 137 1 3.0 3.0 0.0 CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
112 138 1 6.0 6.0 0.0 lastDVdAV.append(ldvotWk / vawoHV)
113 139
114 140 #npi_gb.clear()
115 141 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))
116 142
117 143 # Preparing the dataframe
118 144 # yearWeek and occurances
119 145 #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
120 146 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)
121 147 1 44.0 44.0 0.1 HSDL = np.round(HSDL,2)
122 148 1 21.0 21.0 0.0 HSDG = np.round(HSDG,2)
123 149 1 18.0 18.0 0.0 HSDLp = np.round(HSDLp,2)
124 150 1 18.0 18.0 0.0 HSDGp = np.round(HSDGp,2)
125 151
126 152 1 17.0 17.0 0.0 first = np.round(first,2)
127 153 1 17.0 17.0 0.0 last = np.round(last,2)
128 154 1 17.0 17.0 0.0 wChng = np.round(wChng,2)
129 155 1 16.0 16.0 0.0 wChngp = np.round(wChngp,2)
130 156
131 157 1 5.0 5.0 0.0 vavgs = np.array(vavgs).astype(int)
132 158 1 3.0 3.0 0.0 volAvgWOhv = np.array(volAvgWOhv).astype(int)
133 159 1 17.0 17.0 0.0 HVdAV = np.round(HVdAV,2)
134 160
135 161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
136 162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
137 163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
138 164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
139 165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
140 166
141 167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
142 168
143 169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
144 170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
145 171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
146 172
147 173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
148 174
149 175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
150 176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
151 177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
152 178
153 179 # Removing all yearWeek in df_weekly2 from df_weekly
154 180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
155 181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
156 182 1 5.0 5.0 0.0 c = list(a.difference(b))
157 183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
158 184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
159 185
160 186 # Append the latest week data to df_weekly
161 187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
162 188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
163 189
164 190 1 2.0 2.0 0.0 return df_weekly_all

Below are the sections of the code that have high time,

1Line # Hits Time Per Hit % Time Line Contents
2==============================================================
3 38 1 3.0 3.0 0.0 if exchange == 'BSE':
4 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
5 40 else:
6 41 ticker = df_daily.iloc[0]['symbol']
7
8 95 # Last Traded Date of the Week
9 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
10 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
11 98
12 99 1 2.0 2.0 0.0 lastTrdDoW = []
13 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):
14 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )
15
16 161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
17 162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
18 163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
19 164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
20 165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
21 166
22 167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
23 168
24 169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
25 170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
26 171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
27 172
28 173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
29 174
30 175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
31 176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
32 177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
33 178
34 179 # Removing all yearWeek in df_weekly2 from df_weekly
35 180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
36 181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
37 182 1 5.0 5.0 0.0 c = list(a.difference(b))
38 183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
39 184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
40 185
41 186 # Append the latest week data to df_weekly
42 187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
43 188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))

So started to try on alternative approaches to get things to run faster. I have been reading articles saying 'Numpy is fast', 'Vectors are fast' and started my tests.

Here are some of the attempts to solve the above performance problems,

  1. Getting the ticker code

    1Line # Hits Time Per Hit % Time Line Contents
    2==============================================================
    3 38 1 3.0 3.0 0.0 if exchange == 'BSE':
    4 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
    5 40 else:
    6 41 ticker = df_daily.iloc[0]['symbol']

    Had to settle with numpy here

    1#ticker = df_temp.iloc[0]['sc_code']
    2#ticker = df_daily.at[0,'sc_code']
    3#ticker = df_daily.head(1)['sc_code'].values[0]
    4ticker = df_daily['sc_code'].to_numpy()[0]
  2. 3k to convert to numpy

    1Line # Hits Time Per Hit % Time Line Contents
    2==============================================================
    3 95 # Last Traded Date of the Week
    4 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
    5 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
    6 98
    7 99 1 2.0 2.0 0.0 lastTrdDoW = []
    8 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):
    9 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )

    Pandas approach seems to be bit expensive here and tried to convert it to numpy and faced a TypeError: invalid type promotion. This error came because i was trying to concatenate arr_yearWeek(int64) with arr_ts(timestamp64[ns]) like arr_concat = np.column_stack((arr_yearWeek, arr_ts))

    To resolve this issue, i converted timestamp ts to string df_temp['ts'] = df_temp['ts'] .dt.strftime('%Y-%m-%d') and this worked.

    1arr_concat = np.column_stack((arr_yearWeek, arr_ts))
    2npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
    3
    4lastTrdDoW = []
    5for idx,subarr in enumerate(npi_gb):
    6 lastTrdDoW.append( subarr[-1] )

    Here npi is import numpy_indexed as npi, another library, when i was testing, i had encountered various issues and when i check in stackoverflow for solutions, i have seen user Eelco Hoogendoorn suggest to use numpy_indexed in multiple posts and i had to try it out and also he is the author of the library as well.

    So now most of my code uses this.

  3. Converting to dataframe. Looks like creating a new dataframe is an expensive process. When i stepback and think about the flow of program, this function is called in a loop and for each company a new dataframe is being created and if this takes this much time for single company, obviously whole program is going to be slow. So my thought was, what-if i keep everything in numpy itself, and see how it goes.

    There are two things to solve here

    • Dataframe (Line no.165)
    • Ordering of columns in dataframe(Line no.173 - did not think this process will consume this much time )
    1Line # Hits Time Per Hit % Time Line Contents
    2==============================================================
    3161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
    4162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
    5163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
    6164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
    7165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
    8
    9169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
    10170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
    11171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
    12172
    13173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()

    Combined above two problems into one solution. Since all the values are going to be of same size, np.column_stack helped resolve the issue

    1np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
    2 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
    3 , CPveoHVD, lastDVotWk, lastDVdAV))
  4. Filling same value to entire dataframe column

    1Line # Hits Time Per Hit % Time Line Contents
    2==============================================================
    3167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker

    Resolved by filling the value in numpy

    1ticker = np.full(yearWeek.shape[0], ticker)
  5. Last part, here there are multiple high timers

    • 20k - Basically i am trying to define a dataframe with column, that process seems to be very expensive here.
    • 6k - dataframe concatenation
    • 1.5k - using .isin()
    1Line # Hits Time Per Hit % Time Line Contents
    2==============================================================
    3175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
    4176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
    5177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
    6178
    7179 # Removing all yearWeek in df_weekly2 from df_weekly
    8180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
    9181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
    10182 1 5.0 5.0 0.0 c = list(a.difference(b))
    11183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
    12184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
    13185
    14186 # Append the latest week data to df_weekly
    15187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
    16188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))

    Thought process here is same as before, why should i use dataframe here, can i do everything in numpy. So, proceeding towards that angle.

    1# Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all
    2 if len(np_weekly_all) > 0:
    3 #print(len(np_weekly_all))
    4 a = np_weekly_all[:,1]
    5 b = np_weekly[:,1]
    6 tf_1 = np.isin(a, b, invert=True)
    7 #print(tf_1)
    8 t_result = list(compress(range(len(tf_1)), tf_1))
    9 #print(t_result)
    10 np_weekly_all = np_weekly_all[t_result]
    11 np_weekly_all = np.vstack((np_weekly_all, np_weekly))
    12 else:
    13 np_weekly_all = []
    14 np_weekly_all = np.vstack((np_weekly))

    Interesting concepts learned here are ,

    1. Slicing in numpy
    2. numpy.isin and invert is available
    3. compress of itertools from itertools import compress
    4. vstack

    After making all the above changes in the new run, it took only 7min 47s.

Here is the new profiler report,

1('getcwd : ', '/home/bobby_dreamer')
2Timer unit: 1e-06 s
3
4Total time: 0.013077 s
5File: BTD-Analysis1V3-lf.py
6Function: weekly_trend_analysis_np at line 38
7
8Line # Hits Time Per Hit % Time Line Contents
9==============================================================
10 38 def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):
11 39
12 40 1 4.0 4.0 0.0 if exchange == 'BSE':
13 41 #ticker = df_daily.at[0,'sc_code']
14 42 #ticker = df_daily.head(1)['sc_code'].values[0]
15 43 1 152.0 152.0 1.2 ticker = df_daily['sc_code'].to_numpy()[0]
16 44 else:
17 45 #ticker = df_daily.at[0,'symbol']
18 46 #ticker = df_daily.head(1)['symbol'].values[0]
19 47 ticker = df_daily['symbol'].to_numpy()[0]
20 48
21 49 1 34.0 34.0 0.3 arr_yearWeek = df_daily['yearWeek'].to_numpy()
22 50 1 34.0 34.0 0.3 arr_close = df_daily['close'].to_numpy()
23 51 1 31.0 31.0 0.2 arr_prevclose = df_daily['prevclose'].to_numpy()
24 52 1 29.0 29.0 0.2 arr_chng = df_daily['chng'].to_numpy()
25 53 1 27.0 27.0 0.2 arr_chngp = df_daily['chngp'].to_numpy()
26 54 1 28.0 28.0 0.2 arr_ts = df_daily['ts'].to_numpy()
27 55 1 27.0 27.0 0.2 arr_volumes = df_daily['volumes'].to_numpy()
28 56
29 57 # Close
30 58 1 72.0 72.0 0.6 arr_concat = np.column_stack((arr_yearWeek, arr_close))
31 59 1 651.0 651.0 5.0 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
32 60
33 61 #a = df_temp[['yearWeek', 'close']].to_numpy()
34 62 1 111.0 111.0 0.8 yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
35 63
36 64 1 2.0 2.0 0.0 cmaxs, cmins = [], []
37 65 1 1.0 1.0 0.0 first, last, wChng, wChngp = [], [], [], []
38 66 54 79.0 1.5 0.6 for idx,subarr in enumerate(npi_gb):
39 67 53 465.0 8.8 3.6 cmaxs.append( np.amax(subarr) )
40 68 53 423.0 8.0 3.2 cmins.append( np.amin(subarr) )
41 69 53 86.0 1.6 0.7 first.append(subarr[0])
42 70 53 75.0 1.4 0.6 last.append(subarr[-1])
43 71 53 103.0 1.9 0.8 wChng.append( subarr[-1] - subarr[0] )
44 72 53 142.0 2.7 1.1 wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
45 73
46 74 #npi_gb.clear()
47 75 1 6.0 6.0 0.0 arr_concat = np.empty((100,100))
48 76
49 77 # Chng
50 78 1 24.0 24.0 0.2 arr_concat = np.column_stack((arr_yearWeek, arr_chng))
51 79 1 357.0 357.0 2.7 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
52 80
53 81 1 2.0 2.0 0.0 HSDL, HSDG = [], []
54 82 54 75.0 1.4 0.6 for idx,subarr in enumerate(npi_gb):
55 83 53 387.0 7.3 3.0 HSDL.append( np.amin(subarr) )
56 84 53 402.0 7.6 3.1 HSDG.append( np.amax(subarr) )
57 85
58 86 #npi_gb.clear()
59 87 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))
60 88
61 89 # Chngp
62 90 1 21.0 21.0 0.2 arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
63 91 1 292.0 292.0 2.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
64 92
65 93 1 2.0 2.0 0.0 HSDLp, HSDGp = [], []
66 94 54 75.0 1.4 0.6 for idx,subarr in enumerate(npi_gb):
67 95 53 382.0 7.2 2.9 HSDLp.append( np.amin(subarr) )
68 96 53 403.0 7.6 3.1 HSDGp.append( np.amax(subarr) )
69 97
70 98 #npi_gb.clear()
71 99 1 6.0 6.0 0.0 arr_concat = np.empty((100,100))
72 100
73 101 # Last Traded Date of the Week
74 102 1 33.0 33.0 0.3 arr_concat = np.column_stack((arr_yearWeek, arr_ts))
75 103 1 341.0 341.0 2.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
76 104
77 105 1 1.0 1.0 0.0 lastTrdDoW = []
78 106 54 70.0 1.3 0.5 for idx,subarr in enumerate(npi_gb):
79 107 53 79.0 1.5 0.6 lastTrdDoW.append( subarr[-1] )
80 108
81 109 #npi_gb.clear()
82 110 1 5.0 5.0 0.0 arr_concat = np.empty((100,100))
83 111
84 112 # Times inreased
85 113 1 14.0 14.0 0.1 TI = np.where(arr_close > arr_prevclose, 1, 0)
86 114
87 115 # Below npi_gb_yearWeekTI is used in volumes section
88 116 1 16.0 16.0 0.1 arr_concat = np.column_stack((arr_yearWeek, TI))
89 117 1 267.0 267.0 2.0 npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
90 118
91 119 1 78.0 78.0 0.6 tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
92 120
93 121 # Volume ( dependent on above section value t_group , thats the reason to move from top to here)
94 122 1 16.0 16.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
95 123 1 277.0 277.0 2.1 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
96 124
97 125 1 3.0 3.0 0.0 vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
98 126 54 87.0 1.6 0.7 for idx,subarr in enumerate(npi_gb):
99 127 53 1077.0 20.3 8.2 vavgs.append( np.mean(subarr) )
100 128 53 82.0 1.5 0.6 ldvotWk = subarr[-1]
101 129 53 83.0 1.6 0.6 lastDVotWk.append(ldvotWk)
102 130
103 131 #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
104 132 53 235.0 4.4 1.8 ixDel = np.argmax(subarr)
105 133 53 88.0 1.7 0.7 hV = subarr[ixDel]
106 134 53 73.0 1.4 0.6 vmaxs.append( hV )
107 135
108 136 53 71.0 1.3 0.5 if(len(subarr)>1):
109 137 53 1610.0 30.4 12.3 subarr = np.delete(subarr, ixDel)
110 138 53 1024.0 19.3 7.8 vawoHV = np.mean(subarr)
111 139 else:
112 140 vawoHV = np.mean(subarr)
113 141 53 85.0 1.6 0.6 volAvgWOhv.append( vawoHV )
114 142 53 327.0 6.2 2.5 HVdAV.append(hV / vawoHV)
115 143 53 102.0 1.9 0.8 CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
116 144 53 791.0 14.9 6.0 lastDVdAV.append( np.round(ldvotWk / vawoHV, 2) )
117 145
118 146 #npi_gb.clear()
119 147 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))
120 148
121 149 # Preparing the dataframe
122 150 # yearWeek and occurances
123 151 #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
124 152 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)
125 153 1 59.0 59.0 0.5 HSDL = np.round(HSDL,2)
126 154 1 26.0 26.0 0.2 HSDG = np.round(HSDG,2)
127 155 1 23.0 23.0 0.2 HSDLp = np.round(HSDLp,2)
128 156 1 23.0 23.0 0.2 HSDGp = np.round(HSDGp,2)
129 157
130 158 1 23.0 23.0 0.2 first = np.round(first,2)
131 159 1 23.0 23.0 0.2 last = np.round(last,2)
132 160 1 23.0 23.0 0.2 wChng = np.round(wChng,2)
133 161 1 23.0 23.0 0.2 wChngp = np.round(wChngp,2)
134 162
135 163 1 12.0 12.0 0.1 vavgs = np.array(vavgs).astype(int)
136 164 1 16.0 16.0 0.1 volAvgWOhv = np.array(volAvgWOhv).astype(int)
137 165 1 24.0 24.0 0.2 HVdAV = np.round(HVdAV,2)
138 166
139 167 1 16.0 16.0 0.1 ticker = np.full(yearWeek.shape[0], ticker)
140 168 1 2.0 2.0 0.0 np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
141 169 1 2.0 2.0 0.0 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
142 170 1 546.0 546.0 4.2 , CPveoHVD, lastDVotWk, lastDVdAV))
143 171
144 172 # Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all
145 173 1 2.0 2.0 0.0 if len(np_weekly_all) > 0:
146 174 #print(len(np_weekly_all))
147 175 1 2.0 2.0 0.0 a = np_weekly_all[:,1]
148 176 1 1.0 1.0 0.0 b = np_weekly[:,1]
149 177 1 205.0 205.0 1.6 tf_1 = np.isin(a, b, invert=True)
150 178 #print(tf_1)
151 179 1 13.0 13.0 0.1 t_result = list(compress(range(len(tf_1)), tf_1))
152 180 #print(t_result)
153 181 1 13.0 13.0 0.1 np_weekly_all = np_weekly_all[t_result]
154 182 1 40.0 40.0 0.3 np_weekly_all = np.vstack((np_weekly_all, np_weekly))
155 183 else:
156 184 np_weekly_all = []
157 185 np_weekly_all = np.vstack((np_weekly))
158 186
159 187 1 2.0 2.0 0.0 return np_weekly_all

Needless to say, Hype is real. Numpy wins, its fast.

Issue faced during full execution using numpy,

  1. MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32. This issue occured in two tests, first after processing 1600 companies and second time after 900 company. Since all the numpy elements should be of same type, here in my data, i have (integers, decimals and date). So, all the data is stored as OBJECT, which is bigger in size.

    Tried couple of approaches,

    • Tried to clear memory of used variables like
      1npi_gb = None
      2arr_concat = None
    • Tested(doesn't help) to see if deleting user-defined variables help
      1for name in dir():
      2 if not name.startswith('_'):
      3 print(name)
      4 del globals()[name]
    • Thought, memory might have to do with python installation as well. My computer is 64-bit but python i had installed is 32-bit. So, tried reinstalling and checked it. It doesnt help,
      1# To know why bit of python you are running
      2import struct
      3print(struct.calcsize("P") * 8)
    • Finally, what i did is, write an intermediate result file after processing 500 companies and then reinitialize the variables which is holding that data.

# Things to do next

Going forward, should i write only numpy programs ?

No. It depends, to a certain extent. What am i doing here, i am calling weekly_trend_analysis() in loop passing last 1 yr daily data of each company as argument and function processes it and finally returns 1yr weekly data of that company and this process repeats for all companies.

  • Doing this via Pandas takes 1h 26min
  • Via Numpy takes 7min 47s

Instead of passing daily data of each company in loop, if i pass entire dataset to the function with all the companies via pandas, it takes only Wall time: 1min 22s.

So, pandas are good as well in processing large datasets. We got 2 enemies here, they should be avoided at all cost,

  1. Looping
  2. Thought process that gets you to write a loop.

Couple of other points to remember,

  1. With complex codes, maintainence will hit hard. Pandas are easy to understand than numpy.
  2. If you really want to scale, it easy to convert from Pandas to PySpark, i have done it. (Easy meaning, you have to sit for couple of weeks and do it, but doable).
  3. Numpy is very fast with looping itself it got the elapsed time down from 1hr 26min to 7mins, what i didn't like about it is the memory error, but the problem is with my data, i wanted multiple datatypes, so it had to change everything to object types.

So finally, whats best, both are, but its the thought process thats going to save time.

# Related articles

  1. Running python line_profiler in cloud shell
  2. Github : BSE Trend analysis using Pandas(Notebook)
  3. Github : BSE Trend analysis using Numpy(Notebook)