Words in their raw string form can’t be used as as inputs to machine learning models. Tokenizing is the process of assigning a unique number id for each word in your data set / vocabulary. You can build such a vocabulary by going over your data set and assigning random values to each unique word. You may reserve some of these values as special tokens, e.g.

Marking the beginning or the end of your input
Distinguishing questions from answers

Given that your input will be of varying length, you have to pad your input with zeroes, such that it matches the length of your expected input size.

References

https://www.youtube.com/watch?v=bWLvGGJLzF8

Word Tokenizing

Jul 18, 2024

References