Predicting Hive - An Intro to Regression - 2

avatar
(Edited)

A


lright!
So, we saw how to perform Linear Regression and Polynomial Regression (using a quadratic polynomial) in the first part to this two part series!

NOTE: If you haven't seen the previous post, you can read it here: Predicting Hive - An Intro to Regression - 1.

NOTE: Obtaining a better prediction only means that we have higher chances of being closer to the actual value when the event happens in reality.

1) Polynomial Regression (order 2) - Revisited

In order to use a quadratic, equation, the matrix equation that needs to be solved is as follows:

The code that we wrote for this kind of regression was:

import os
import numpy as np
import matplotlib.pyplot as plt

# Function to sum the elements of an array
def sum(a):
    sum = 0
    for i in range(len(a)):
        sum += a[i]
    return sum

fileData = open("<location>/data.csv", "r")

for line in fileData:
    line = fileData.readlines()

print(*line)

x = np.zeros(len(line))
y = np.zeros(len(line))

for i in range(len(line)):
    x[i], y[i]= line[i].split(",")

sumX = sum(x)
sumX2 = sum(pow(x,2))
sumX3 = sum(pow(x,3))
sumX4 = sum(pow(x,4))
sumY = sum(y)
sumXY = sum(x*y)
sumX2Y = sum(pow(x,2)*y)

print(x,y)
print("sumX, sumX2, sumX3, sumX4, sumY, sumXY, sumX2Y", sumX, sumX2, sumX3, sumX4, sumY, sumXY, sumX2Y)

n = 3

data_m = np.zeros((n,n+1))

#Explicitly Defining the Augmented Matrix
data_m[0,0] = n
data_m[0,1] = sumX
data_m[0,2] = sumX2
data_m[0,3] = sumY
data_m[1,0] = sumX
data_m[1,1] = sumX2
data_m[1,2] = sumX3
data_m[1,3] = sumXY
data_m[2,0] = sumX2
data_m[2,1] = sumX3
data_m[2,2] = sumX4
data_m[2,3] = sumX2Y

print("Initial augmented matrix = \n", data_m)

# Elimination
for j in range(1,n):
    #print("LOOP-j:", j)
    for k in range(j,n):
        #print("     LOOP-k:", k)
        factor = data_m[k,j-1] / data_m[j-1,j-1]
        for i in range(n+1):
            #print("         LOOP-i:", i, "| ", data_m[k,i])
            data_m[k,i] = format(data_m[k,i] - factor*data_m[j-1,i], '7.2f')
            #print("-->",data_m[k,i])

print("Matrix after elimination = \n", data_m)

# Back Substitution
solution = np.zeros(n)

for j in range(n-1, -1, -1):
    subtractant = 0
    for i in range(n-1,-1,-1):
        subtractant = subtractant + solution[i] * data_m[j,i] 
    solution[j] = (data_m[j,n] - subtractant)/data_m[j,j]
print("Solution matrix:\n", solution)

y2 = solution[0] + solution[1]*x + solution[2]*pow(x,2)

ax = plt.subplot()
ax.plot(28*x,y, "ob", linestyle="solid")
ax.plot(28*x, y2, "ob", linestyle="solid", color="g")
ax.plot(350,(solution[0] + solution[1]*12.5 + solution[2]*pow(12.5,2)), 'ro')
plt.grid(True)
plt.title("Hive Price Chart")
ax.set_xlabel("Time (days)")
ax.set_ylabel("Price in USD")

plt.show()

...and the output we got looked something like this:

The green curve is the one we have fitted, the blue one is the actual data, and the red dot is our prediction for June 1.

Now, we'll try to improve our curve-fitting and try to use a Polynomial Regression using higher order polynomial, so that we can get a more accurate prediction.

2) Why do we need higher order polynomials?

Just ponder upon the following statements:

  • A unique curve that'll always pass through two given points is a straight line.
  • A unique curve that'll pass through three given points is a quadratic polynomial.
  • ...and so on...
  • A unique curve that'll pass through n given points is a polynomial of order (n-1).

That's the reason behind our craving for an n-th order poly.

If we take the matrix equation for the previous quadratic case and carefully look at it, we'll find a pattern...
(Just have a look again!)

PATTERN:

  • In the first matrix, we can clearly see how the powers of xi keep increasing as we move down and to the right. This matrix is actually symmetric about its diagonal.
  • Next, in the matrix P (i.e. the right-most one), we again have powers of xi progressively increasing as we move down.

So, building upon the pattern, we can easily say that the matrix equation to be solved for nth order polynomial will be:

image.png

Now, based upon our knowledge, we'll modify the code for the quadratic case, and make it generalised.

Full Code:

import os
import numpy as np
import matplotlib.pyplot as plt

# Function to sum the elements of an array
def sum(a):
    sum = 0
    for i in range(len(a)):
        sum += a[i]
    return sum

fileData = open("<location>/data.csv", "r")

for line in fileData:
    line = fileData.readlines()

print("Data received:\n",*line)

x = np.zeros(len(line))
y = np.zeros(len(line))

for i in range(len(line)):
    x[i], y[i]= line[i].split(",")

print("x-matrix:\n",x,"\n y-matrix:\n",y)

# Defining order for the polynomial to be used in regression
order = input("Please enter the order of the polynomial you wish to use for interpolation.")

if (order == "default"):
    n = len(x)
if (order != "default"):
    n = int(order) + 1

print(n, type(n))
data_m = np.zeros((n,n+1))

#Explicitly Defining the Augmented Matrix
#Generalising Augmented Matrix Definition
# Defining the matrix A
for j in range(0,n): #Row counter
    for i in range(0,n): #Column counter
        if(i == 0 and j == 0):
            data_m[j,i] = n
        if(i!=0 or j!=0):
            data_m[j,i] = sum(pow(x,(i+j)))

# Defining the matrix B
for j in range(0,n):
    data_m[j,n] = sum(y*pow(x,j))

print("Initial augmented matrix = \n", data_m)

# Elimination
for j in range(1,n):
    #print("LOOP-j:", j)
    for k in range(j,n):
        #print("     LOOP-k:", k)
        factor = data_m[k,j-1] / data_m[j-1,j-1]
        for i in range(n+1):
            #print("         LOOP-i:", i, "| ", data_m[k,i])
            data_m[k,i] = format(data_m[k,i] - factor*data_m[j-1,i], '7.2f')
            #print("-->",data_m[k,i])

print("Matrix after elimination = \n", data_m)

# Back Substitution
solution = np.zeros(n)

for j in range(n-1, -1, -1):
    subtractant = 0
    for i in range(n-1,-1,-1):
        subtractant = subtractant + solution[i] * data_m[j,i] 
    solution[j] = (data_m[j,n] - subtractant)/data_m[j,j]
print("Solution matrix:\n", solution)

y2 = np.zeros(len(x))

for j in range(0,n):
    y2 = y2 + solution[j]*pow(x,j)

print(y2)

ax = plt.subplot()
ax.plot(28*x,y, "ob", linestyle="solid")
ax.plot(28*x, y2, "ob", linestyle="solid", color="g")
plt.grid(True)
plt.title("Hive Price Chart")
ax.set_xlabel("Time (days)")
ax.set_ylabel("Price in USD")

plt.show()

Additional Features we have added:

  • The user can now decide the order of the polynomial to fit. Selecting default tells the program to take the total number of data points as the order of the polynomial.

One thing we haven't added yet is the prediction (extrapolation) part. We need to extrapolate the fitted curve to day=350, to get the Hive price on June 1 2020.

For this, we'll just make the following changes near the end of code just above plt.show() :

.
.
.
for j in range(0,n):
    y2 = y2 + solution[j]*pow(x,j)

def predict(x):
    prediction = 0
    for j in range(0,n):
        prediction += solution[j]*pow(x,j)
    return prediction

print(y2)

ax = plt.subplot()
ax.plot(28*x,y, "ob", linestyle="solid")
ax.plot(28*x, y2, "ob", linestyle="solid", color="g")
ax.plot(350,predict(12.5), 'ro')
plt.grid(True)
plt.title("Hive Price Chart")
ax.set_xlabel("Time (days)")
ax.set_ylabel("Price in USD")
.
.
.

NOTE: As already mentioned in the previous post, we are using 12.5 and not 350 for our prediction because our step size is 28. (12.5 * 28 = 350).

Now, our coding part is complete!...and we are ready to test and predict!!

Order of PolynomialFitPrediction
1 (linear)image.png0.043 $ (ERROR: Here, one thing is very clear. This fit shouldn't start from 0!)
2 (quadratic)image.png0.214 $
3 (cubic)image.png0.58 $
4image.png0.28 $
10image.png-1.103 $ (The ERROR is very much clear here!!)
default (order = 12)image.png-0.55 $ (Haha!!)

Ok, so we have seen above that though polynomial interpolation seems to fit the data well..but, we have checked at intervals of x = 28...let's try checking the fitted function at a higher resolution so that we can see what is happening in between those points.

Plus we also know one problem with this technique, that the curves always try to pass from near 0 before they rise up to the desired level.

In the table above, we can see the curve clearly up till an order of 4, but for higher orders, the curve is not clear because of the low resolution we have used to print it. So, let's use higher resolution, and see how the curve really performs...this will also tell us why we are getting odd, erroneous predictions.

Order of PolynomialFitComments
5image.pngWe can see the curve varies wildly between data points.
8image.png
10image.pngThis choice of order performs the worst (for our case).
14image.png

So, as you can see...using higher order polynomials to fit the curve is not much of a boon because it gives the curve, extra degrees of freedom, allowing the curve to vary wildly between and beyond the given data points.

CONCLUSION: Using Higher order polynomials may be good for fitting the data, but it is definitely not a good idea for use in extrapolation, and for purposes of prediction.

IN THE NEXT ARTICLE:
We'll see how to find the optimum order of polynomial to fit our curve, too low is bad because it doesn't fit the data properly, and hence has less amount of info, too high is also bad because it allows the curve to vary wildly. Optimisation is the key!!


Hope you learned something new from this article!
Thanks for your time.

Best,
M. Medro


Credits

All media used in this article have been created by me.



0
0
0.000
4 comments
avatar

Congratulations @medro-martin! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s) :

You received more than 8000 upvotes. Your next target is to reach 9000 upvotes.

You can view your badges on your board And compare to others on the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

To support your work, I also upvoted your post!

Do not miss the last post from @hivebuzz:

Project Activity Update
0
0
0.000