λ
home posts study uses projects

Word Tokenizing

Jul 18, 2024

machine-learning

Words in their raw string form can’t be used as as inputs to machine learning models. Tokenizing is the process of assigning a unique number id for each word in your data set / vocabulary. You can build such a vocabulary by going over your data set and assigning random values to each unique word. You may reserve some of these values as special tokens, e.g.

  • Marking the beginning or the end of your input
  • Distinguishing questions from answers

Given that your input will be of varying length, you have to pad your input with zeroes, such that it matches the length of your expected input size.

References

LλBS

All systems up and running

Commit c0daef5, deployed on Oct 01, 2024.