David Lee

NYC Taxi Fare Prediction using Linear Regression with BigQuery and PySpark

Data Science

Running a Simple Linear Regression on a regression problem might not be the state-of-the-art machine learning method to learn parameters. However, when the volume of data increases drastically, state-of-the-art algorithms will not be able to learn parameters quick enough even with the best computing resources. As such, in this repo, I attempt the use of a simple linear regression to predict NYC taxi fares on data that has over 100 million records per year in a single SQL table. This data is stored in BigQuery public datasets and PySpark would be the tool I am going to to handle such big-data tasks.

Spanish NER using Python and spaCy

Data Science

In this project, we want to correct tag Spanish words in a sentence with appropriate sentences tags B - Beginning, I - Inside, O - Outside. spaCy provides a NER class which I can use to train the labelled Spanish sentences. Using the trained NER model, it will then generate a tag for a new word it sees.

Emoji Prediction with Neural Machine Translation

Data Science

This project follows Task 2 of SemEval 2018 where participants of the workshop will have to predict the emoji of a tweet using the twitted sentence. Data is available for both English and Spanish tweets and NMT is used here to translate between the two languages so as to increase the amount of data for training.

How to install & setup Oracle DB within a Docker container on MacOS

Data Engineering

As part of the BU CS779 Advanced Databases Spring '19, the class required us to go walk ourselves through the documentation on how to set up a SQL Database. Since I'm a MacOS user, Oracle does not have an installer for it except for Windows and Linux. As such, I have installed my Oracle DB on Linux in a Docker machine. As a result, my Oracle DB worked really light on my machine compared to having VirtualBox running in the background. I've documented the steps on how this can be done here.

Data of Everything

Projects

NYC Taxi Fare Prediction using Linear Regression with BigQuery and PySpark

Spanish NER using Python and spaCy

Emoji Prediction with Neural Machine Translation

How to install & setup Oracle DB within a Docker container on MacOS

About

A data scientist with over 4 years of experience in Data Science, Data Engineering and Web Development.

Loves corgis, coding, data and machine learning.

He has recently graduated from Boston University with a M.S. in Computer Information Systems.

Currently working as a data scientist at NUHS.