Home | Login

Tagged: machine learning


Concatenative Synthesis Experiments
We take an audio file and try to reconstruct it from frames of video using k-nearest neighbors to pick the best ones

#python #music #audio #machine learning
Vowpal Wabbit Experiments
Just some thoughts and notes while messing with vowpal wabbit, fast machine learning tool

#machine learning #vw
Spotify Annoy (ANN) Experiments
Using Spotify's Approximate Nearest Neighbor's library to mess with bitcoin market data

#machine learning




I don't think I'll post any code in-line here.  I might post the code on github sometime.  I just decided to play around with trying something besides sci-kit learn which is what I used with the concatenative synthesis stuff.  I was also trying it with bitcoin market stuff too, I was just more paranoid back then... maybe I should be paranoid though... nah... *looks over shoulder*

Anyway, I got far enough to simulate trading (even made a few live trades, but I was impatient and set the tolerances  really tight so I could see trades execute... I didn't leave it on long, just to see if it "worked")

I was pretty impressed after twiddling my "knobs" and getting charts like that... but I wanted to load more data and sci-kit learn was so slow, I think I maxed out my patience at around 5k samples.

With annoy I was able to generate a tree file on the server of 100K points, gzip, download and run magick on it.

I downloaded the following 20K points (after the 100K used to build the tree) to test.

I haven't made a trading system with it yet, but here's the charts of the actual price vs. the culmulative addition of the predicted price change every 5 second tick.  Green is the actual price (average of best bid and best ask, I haven't been recording trade data, just depth), red is the 5 nearest neighbors averaged together.  There are 600 "features" in each vector.

Zoomed all the way out:

Zoomed into just the beginning:

Looks promising.

#machine learning #annoy #btc



I just saved a couple charts, the green is the predicted output, the red is the actual values.  The predicted outputs seem clamped with these settings:

../vowpalwabbit/vw -d training.txt -k -c -f btce.model --loss_function squared -b 25 --passes 20 -q ee --l2 0.0000005

No decimation (downsampling) ~20K datapoints:

Downsampled with a factor of 8 (~2.5K datapoints):

../vowpalwabbit/vw -d training.txt -k -c -f btce.model --loss_function squared --passes 20 --l2 0.0000005

This model worked better, looking at it closely you can see:

And this is only working with about a fifth of the data collected so far.  Crazy that it actually seems to work sort of... in a muted sense.

Here's the graphing code for good measure:

#! /usr/bin/python2

import numpy as np
import matplotlib.pyplot as plt

from scipy import signal

actual_values = []
predicted_values = []

with open('test.txt', 'r') as test_f:
    for line in test_f:

with open('predictions.txt', 'r') as predictions_f:
    for line in predictions_f:

# Decimate the charts
# actual_values = signal.decimate(actual_values, 10)
# predicted_values = signal.decimate(predicted_values, 10)

data_len = len(actual_values)
print data_len

x = np.arange(0, data_len)

plt.plot(x, actual_values, 'r-',  x, predicted_values, 'g-')

I'll probably work on working with gzipped datasets/vw cache files exclusively as I move forward.  I'm just worried about dealing with bigger datasets. Whee!

#python #machine learning #vw #matplotlib #



LINECOUNT=$(wc -l < $1)
shuf $1 > tmp.txt

tail -n $(( $LINECOUNT - $TWENTYPCT )) tmp.txt > training.txt
head -n $TWENTYPCT tmp.txt > test.txt

rm tmp.txt

Wrote that script to take a raw VW formatted text file of data, scramble it and split it into 2 files: 80% into the training.txt file, and 20% into the test.txt file to test how well the regressor does after training.



import gzip
from pymongo import MongoClient
from bson.objectid import ObjectId

client = MongoClient()
db = client.btce
mongo_log = db.depth_log

mongo_log.ensure_index([('time', 1)])

# Get features, prev_depth from 100,000 logs
data = mongo_log.find({}, fields={'features': True, 'prev_depth': True}).sort('time',  1).limit(100000)

f = gzip.open("vw_btce.gz", "w")

for log in data:
  prev_id = log['prev_depth']
  line = ""
  if prev_id:
    prev_log = mongo_log.find_one({ "_id": prev_id }, fields={'features': True})['features']

    # Output is the current price diff (feature index = 0)
    line += str(log['features'][0]) + " |"

    # Input features is previous depth log
    for i, feature in enumerate(prev_log):
        line += " " + str(i) + ":" + str(feature)

    line += "\n"

print "Finished."

I was really happy to find a way to gzip the data from the db as I converted it to the VW format.  The script generated a 196MB gzipped file that I could download and gunzip on my local machine to 1.1GB!!  And that's only about a fifth of all the data that I've collected so far... I've been polling the btc-e API every 5 seconds since late October (on a VPS) and have been storing the orderbook depth (and pre-regularized feature set) in MongoDB... so I have just over half a million data points at the moment.

I did some experiments with the data using the K-NN method, and had some interesting results, but I found it to be pretty slow especially with 600 features per sample (bid prices, bid vol., ask prices, ask vol. ... all "regularized" or "normalized" to be between 0 and 1).

Anyway I ended up finding out about VW this week and have been looking into it for a few days, and finally decided to buckle down and try it with the "trade my bitcoins/USD automatically and skim a profit" problem that I had pushed aside because it sorta got boring.  See, I just need to get something working so when I push it aside it'll be doing something lol.

So, after I got the data and split it into the training and test sets I ran vowpal wabbit noobishly following tutorials and the command line page... ah, and this stack overflow question was helpful too.

It ran without a hitch the first time, which was really neat. Eventually I tweaked the command line switches to generate a vw cache file and a vw model file...  I don't know much about either yet.

../vowpalwabbit/vw -c --passes 2 training.txt -f btce.model
final_regressor = btce.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = training.txt.cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.040602   0.040602            1         1.0   0.2015   0.0000      603
0.520301   1.000000            2         2.0  -0.0000   1.0000      603
0.260151   0.000000            4         4.0  -0.0000  -0.0000      603
0.182807   0.105463            8         8.0  -0.0000  -0.0000      603
0.096705   0.010604           16        16.0  -0.0000  -0.2575      603
0.239296   0.381886           32        32.0  -0.0680   0.2297      603
0.162444   0.085592           64        64.0   0.1570   0.0754      603
0.114285   0.066126          128       128.0   0.2820  -0.0029      603
0.156524   0.198762          256       256.0  -0.0000   0.0095      603
0.121862   0.087199          512       512.0   0.3805  -0.0610      603
0.127856   0.133851         1024      1024.0  -0.0000   0.0182      603
0.116276   0.104695         2048      2048.0  -0.0150   0.0023      603
0.111106   0.105936         4096      4096.0  -0.0000  -0.1003      603
0.110595   0.110084         8192      8192.0   0.0000  -0.1014      603
0.104141   0.097687        16384     16384.0  -1.0930  -0.0250      603
0.098368   0.092595        32768     32768.0   0.0000   0.1564      603
0.093588   0.088807        65536     65536.0   0.0010   0.0817      603
0.082555   0.082555       131072    131072.0   0.0005   0.0754      603 h

finished run
number of examples per pass = 72000
passes used = 2
weighted example sum = 144000
weighted label sum = -163.108
average loss = 0.0792736 h
best constant = -0.00113269
total feature number = 86829594

Looking at "current label" and "current predict" columns we can see it sorta sucks at the moment, running with the defaults and 2 passes (which I really don't know what it does yet either... so much more reading awaits me!)... Current label is the what the price is in the next 5 seconds, and current predict is what the regressor predicts the price will be "in the next 5 seconds".  It doesn't have any concept of time, just how I "labeled" the data with the next datapoint's price difference... that's why the numbers are so low or zero a lot of the time, because in 5 seconds the price usually  doesn't change that much... I'm hoping there's a pattern in the general shape of the orderbook that predicts a massive order going through that will throw the books out of wack for a few seconds in which the bot can throw in an order appropriately in the seconds before... if such a pattern even exists.  K-NN seemed to pick up on something, so there's hope!

Here we run the test.txt data through vw:

$ ../vowpalwabbit/vw -t -i btce.model -d test.txt -p predictions.txt
only testing
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = predictions.txt
using no cache
Reading datafile = test.txt
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.001363   0.001363            1         1.0  -0.0015  -0.0384      603
0.000841   0.000320            2         2.0  -0.0000   0.0179      603
0.001124   0.001406            4         4.0  -0.0000   0.0511      603
0.020274   0.039424            8         8.0  -0.0020   0.0668      603
0.026243   0.032213           16        16.0  -0.1590   0.0221      603
0.068017   0.109790           32        32.0   0.1825   0.0111      603
0.046948   0.025879           64        64.0  -0.0000  -0.0045      603
0.059675   0.072402          128       128.0  -0.0005   0.0276      603
0.079921   0.100167          256       256.0   0.4955   0.0770      603
0.092016   0.104112          512       512.0   0.0015   0.1556      603
0.080258   0.068500         1024      1024.0   0.0000   0.0325      603
0.076544   0.072830         2048      2048.0  -0.0000   0.0266      603
0.080918   0.085291         4096      4096.0   0.0000   0.0983      603
0.077156   0.073394         8192      8192.0   0.0000   0.0992      603
0.078745   0.080333        16384     16384.0   0.0045   0.0526      603

finished run
number of examples per pass = 19999
passes used = 1
weighted example sum = 19999
weighted label sum = 41.821
average loss = 0.0794741
best constant = 0.00209115
total feature number = 12059071

Not very pleased with the results still, especially this line:

average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features

0.079921   0.100167          256       256.0   0.4955   0.0770      603

The label was a relatively massive 50 cent difference in price in 5 seconds, but the prediction only said 7 cents... and it was predicting 9 cents with 0 change.

I definitely have to keep working at this, maybe reduce the depth log into 3 parts close, mid, far in relation to the last executed trade price, and then have the relative volumes/prices calculated from that.  It would actually probably be good to calculate other "indicators" used in trading and put those into the feature set.

Anyway, I thought I'd throw down some notes, I kept my work on this under wraps for a bit, but I figure this is pretty useless as it stands.

#bash #python #vw #machine learning



Using pylab to generate a spectrogram (example) for each frame of a video and a same length frame of audio, we can match frames of video that sound similar (using nearest neighbors) to the input audio approximating a re-creation of the original audio. smiley

I used two seperate scripts so I could try different audio with each video without having to re-generate a specgram for each frame of video each time and instead just store the specgrams to a Numpy data file (create_clips.py).  Then I ran NN-search.py to search and render a final video using MoviePy.


from pylab import *
from moviepy.editor import *
import numpy as np

FILENAME = ".mp4"

oClip = VideoFileClip(FILENAME)
FRAME_DURATION = 1.0 / oClip.fps

# clip_ffts is an array filled with the specgram of each frame of audio
clip_ffts = [] 

# loop over each frame and calculate specgram (power of particular frequencies)
for i in np.arange(0, oClip.duration - FRAME_DURATION, FRAME_DURATION):
    clip = (oClip
            .subclip(i, i + FRAME_DURATION))

    # Stereo to mono by averaging both channels with np.mean
    test_clip = np.mean(clip.to_soundarray(fps=16000, nbytes=2), axis=1).flatten().astype(np.float64)

    # Calculate the specgram using pylab
    Pxx, freqs, bins, im = specgram(test_clip, NFFT=512, Fs=16000, window=window_hanning, noverlap=440, detrend="mean")

# Convert python list to Numpy array
clip_ffts = np.array(clip_ffts)

# Save numpy array for future uses with NN-search.py
np.save(FILENAME + ".npy", clip_ffts)
print clip_ffts


from sklearn.neighbors import NearestNeighbors
import numpy as np
from pylab import specgram, cm, window_hanning
from moviepy.editor import *

# Video file we will use to try and approximate the audio file
FILENAME = ".mp4"

oClip = VideoFileClip(FILENAME)
FRAME_DURATION = 1.0/ oClip.fps

# The "target" audio file
tClip = AudioFileClip(".wav")

# We must generate a Numpy file containing an array of specram data from the video and load it
X = np.load(FILENAME + '.npy')

# Fitting the nearest neighbors model to the specgram data generated from create_clips.py
nbrs = NearestNeighbors(n_neighbors=1).fit(X)

# List containing moviepy clips of the nearest neighbor to the target audio frame
out_clips = []

# Loop over each fram in the target clip (tClip)
for i in np.arange(0, tClip.duration-FRAME_DURATION, FRAME_DURATION):
    test_clip = np.mean(tClip.subclip(i,i+FRAME_DURATION).to_soundarray(fps=16000, nbytes=2), axis=1).flatten().astype(np.float64)

    # Generate specgram from target audio
    Pxx, freqs, bins, im = specgram(test_clip, NFFT=512, Fs=16000, window=window_hanning, noverlap=440, detrend="mean", cmap=cm.gray)

    # Find nearest neighbor from frames of video
    distances, indices = nbrs.kneighbors(Pxx.flatten())
    print distances
    index = indices[0][0]
    print index

    # Push clip to be concatenated to list based on index and frame rate
    out_clips.append(oClip.subclip(index*FRAME_DURATION , (index*FRAME_DURATION)+FRAME_DURATION))

out_vid = concatenate(out_clips)
print "done!"

I'm no DSP expert so the settings for the FFT to generate the specgram was completed using trail and error (with some stack exchange hints from other people's questions).

I might try and work on a script that can use multiple video files at some point.

I should also normalize the audio levels between video and audio inputs so the NN matching works better.

#python #machine learning #nearest neighbors #dsp