Running demo colabs and watching online courses are fine, but the only way for me to really understand the details is by doing it myself.
Loading the dataset
Let's start by loading the dataset. First I looked at the docs at https://www.tensorflow.org/datasets/catalog/imdb_reviews. Surprisingly it shows no sample code to load it. The page contains a link to the original page where we can download it as a file. But I'd rather use the Datasets API. The page links out to the code to tfds.text.imdb.IMDBReviews. But this code doesn't show how to load the data either.
Another Google search reveals that another packages exposes the IMDB dataset: tf.keras.datasets.imdb (page). So then, which one should I use? I guess tfds is supposed to be more friendly API-wise?
I wish it was also put in https://www.tensorflow.org/datasets/catalog/imdb_reviews.
So the code is:
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True, shuffle_files=True)
Exploring the dataset
Let's look at train_data.
<_OptionsDataset shapes: ((None,), ()), types: (tf.int64, tf.int64)>
It's a dataset, so we can take the first example and print it out:
So each item in train_dataset is a tuple of (input Tensor, output Tensor). The input is a 1D Tensor and contains 222 words in this example. To verify this I ran
print(tf.shape(input)) and got
tf.Tensor(, shape=(1,), dtype=int32).
The output is a 0d Tensor that contains one value: whether the review was positive (value 1) or negative (value 0).
It'd be nice to read the review as text though.
SubwordTextEncoder sounds like that's what I am looking for. The doc at https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder shows how to use it. It exposes a
decode method that "Decodes a list of integers into text.".
Here's how to get to the Text instance:
The Text class (https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Text) exposes several members that are of interest:
I tried them all and it turns out,
decode_example won't work. But
int2str work as expected: