NYC Taxi Fare Prediction using Linear Regression with BigQuery and PySpark
Data Science
Running a Simple Linear Regression on a regression problem might not be the state-of-the-art machine learning method to learn parameters.
However, when the volume of data increases drastically, state-of-the-art algorithms will not be able to learn parameters quick enough even with the best computing resources.
As such, in this repo, I attempt the use of a simple linear regression to predict NYC taxi fares on data that has over 100 million records per year in a single SQL table.
This data is stored in BigQuery public datasets and PySpark would be the tool I am going to to handle such big-data tasks.
Spanish NER using Python and spaCy
Data Science
In this project, we want to correct tag Spanish words in a sentence with appropriate sentences tags B - Beginning, I - Inside, O - Outside.
spaCy provides a NER class which I can use to train the labelled Spanish sentences. Using the trained NER model, it will then generate a tag
for a new word it sees.
Emoji Prediction with Neural Machine Translation
Data Science
This project follows Task 2 of SemEval 2018 where participants of the workshop will have to
predict the emoji of a tweet using the twitted sentence. Data is available for both English and Spanish tweets and NMT is used here to translate
between the two languages so as to increase the amount of data for training.
How to install & setup Oracle DB within a Docker container on MacOS
Data Engineering
As part of the BU CS779 Advanced Databases Spring '19, the class required us to go walk ourselves through
the documentation on how to set
up a SQL Database. Since I'm a MacOS user, Oracle does not have an installer for it except for Windows and Linux.
As such, I have installed my Oracle DB on Linux in a Docker machine. As a result, my Oracle DB worked really light
on my machine compared to having VirtualBox running in the background.
I've documented the steps on how this can be done here.