Classification using Tensorflow with feature columns and dataset pipeline

Shravan C
4 min readFeb 10, 2020
https://www.kdnuggets.com/2019/08/tensorflow-20.html

In this article, I will try to cover important TensorFlow APIs. Recently completed a specialization on Coursera regarding Machine Learning with TensorFlow on Google Cloud Platform. I had to complete it in a hurry save some money. I remained curious about the TensorFlow. Tensorflow website itself has very good API documentation and tutorials. I referred to this article and recreated it for a different dataset as suggested by the end of the article.

From the specialization above I have learned that tf.data, tf.feature_colum, and tf.estimator API are core APIs to develop any machine learning model and to deploy it to production. Deploying to production is not covered in this article. In this article, I want to stress more on tf.data, and tf.feature_column. tf.estimator is not considered in this article. It can be referred to in this link. Will cover tf.estimator in another article with a different dataset.

Dataset for this is taken from the UCI website. I have modified the last column for the simplicity/classification problem and available in this link. Complete code is available on Github. It is also available on Colab. This basically consists of four steps:

  • Loading data from CSV using pandas.
  • Create a pipeline with tf.data to shuffle and batch the input data.
  • Define feature_columns with respect to the fields present in the dataframe.
  • Finally, the model is built, trained and then evaluated with the test data/

tf.data

tf.data API is used to create a pipeline for the input data. It can be assumed like encapsulating your input data into tf.data object. Which in turn provides you with many different functions. Once this is encapsulated with tf.data it can then be piped with many different operations with the dot operation the tf.data class supports. One such pipeline example could be as below

dataset = tf.data.Dataset
.range(100)
.shuffle(100, reshuffle_each_iteration=True)
.batch(5)
.repeat(5)

More on tf.data can be found in the guide and api_docs.

In this classification program, once the CSV is read from pandas, it is then converted into a dataset with a similar pipeline shown above. This covers the first two points above in the bulletins.

URL = "./adult.data.csv"
df = pd.read_csv(URL)
train, test = train_test_split(df, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
# Converting Dataframes to Dataset.
def df_to_dataset(df, shuffle=True, batch_size=32):
df = df.copy()
labels = df.pop('good_salary')
ds = tf.data.Dataset.from_tensor_slices((dict(df) , labels))
if shuffle:
ds.shuffle(buffer_size=len(df))
ds = ds.batch(batch_size)
return ds
#Creating Dataset for training, validation and testing datatframes.
batch_size=32
train_ds = df_to_dataset(train, shuffle=True , batch_size=batch_size)
val_ds = df_to_dataset(val , shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test , shuffle=False, batch_size=batch_size)

tf.feature_column

FeatureColumn is again an encapsulation for each individual column inside a data frame. What it does is, take each column in a data frame, encapsulate it to feature_column object and perform certain transformation objects. It does the input preprocessing job in the machine learning process. Whenever new data is fed to the model automatically preprocessing is done and then fetched to the model. No need to further do the preprocessing. Set the rules once and then provide the data to the model. There are many different feature_columns provided by the TensorFlow. More can be found in this tutorial. Continuing with the code:

feature_columns = []
numeric_headers = ['age', 'education_num', 'fnlwgt', 'capital_gain', 'capital_loss', 'hours_per_week']
for header in numeric_headers:
feature_columns.append(feature_column.numeric_column(header))
# embedding
# native_country
country = feature_column.categorical_column_with_vocabulary_list('native_country', df['native_country'].unique())
country_embedding = feature_column.embedding_column(country, dimension=40)
feature_columns.append(country_embedding)
#demo(country_embedding)
occupation = feature_column.categorical_column_with_vocabulary_list('occupation', df['occupation'].unique())
occupation_embedding = feature_column.embedding_column(occupation, dimension=10)
feature_columns.append(occupation_embedding)
#demo(occupation_embedding)
education = feature_column.categorical_column_with_vocabulary_list('education', df['education'].unique())
education_embedding = feature_column.embedding_column(education, dimension=20)
feature_columns.append(education_embedding)
print("*****")#indicator cols
relationship = feature_column.categorical_column_with_vocabulary_list('relationship', df['relationship'].unique())
relationship_one_hot = feature_column.indicator_column(relationship)
feature_columns.append(relationship_one_hot)
#sex
sex = feature_column.categorical_column_with_vocabulary_list('sex', df['sex'].unique())
sex_one_hot = feature_column.indicator_column(sex)
feature_columns.append(sex_one_hot)
#race
race = feature_column.categorical_column_with_vocabulary_list('race', df['race'].unique())
race_one_hot = feature_column.indicator_column(race)
feature_columns.append(race_one_hot)
#workclass
workclass = feature_column.categorical_column_with_vocabulary_list('workclass', df['workclass'].unique())
workclass_one_hot = feature_column.indicator_column(workclass)
feature_columns.append(workclass_one_hot)
#bucketize cols
age = feature_column.numeric_column('age')
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)
hours_per_week = feature_column.numeric_column('hours_per_week')
hours_per_week_bucketize = feature_column.bucketized_column(hours_per_week, boundaries=[15, 20, 25, 30, 35, 40, 45, 50, 60])
feature_columns.append(hours_per_week_bucketize)
#hashed bucket
martial_status_hashed = feature_column.categorical_column_with_hash_bucket(
'martial_status', hash_bucket_size=1000)
martial_status_one_hot = feature_column.indicator_column(martial_status_hashed)
feature_columns.append(martial_status_one_hot)

These are the feature columns I created for the dataset I have used. This article gives a new example to work with other than the examples provided in the TensorFlow documentation. One way to learn is to replicate the same on a new dataset.

Model

Last but not the least, building the model. This article does not stress the model evaluation. My intention is to stress on the feature_column, so accuracy might not be high.

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)model = tf.keras.Sequential([
feature_layer,
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds, validation_data=val_ds, epochs=10)loss, accuracy = model.evaluate(test_ds)

I was able to achieve 81% accuracy. I am now more motivated to try more classification problems and master feature_column and also try crossed_feature_column. And also another thing to explore is tf.estimator. Eager to explore the same. And also am happy to be able to run the above program successfully on local on GPU. For running the TensorFlow program on GPU, please follow my article. Enjoy coding…

--

--

Shravan C

Software Engineer | Machine Learning Enthusiast | Super interested in Deep Learning with Tensorflow | GCP