Spam & Toxic
Text Detection
What is toxic text? 🤬
Anything disrespectful,
abusive,
unpleasant,
harmful, and/or simply
irrelevant (SPAM).
Detecting toxic text directly in the browser allows you to
filter it out at the origin, even before it reaches your
servers.
How does it work?
Once a sentence is submitted, the text is tokenized and passed to
the model. The model then returns a
toxicity level, ranging from 0 to 100. If it's greater than a given threshold,
the text is considered toxic.
In the validation above, the threshold is
75. Anything equal
or above is considered inappropriate and will be visually marked
as toxic. In your code, you can change this value to your liking
and to best fit your needs.
About the model
The model is trained using the
Model Maker's average word embedding algorithm,
trained on a custom dataset of almost 2 thousands classified
comments from Youtube and other sources.
I trained the original Python model using
Google Colab, then
converted it in the TensorFlowJS format to be used in the browser.
The entire model is just
199KB, which makes
it very lightweight.
Future improvements 🚀
Currently the model only supports
English. It
would be great to add support for other languages in the future.
The only challenge is to find good datasets to train the model
on.
Also, at present, for maximum accuracy the model can validate no
more than
20 words at a
time. This is not much of a limitation though, as you can simply
split a long text into smaller chunks and validate each one.
Although great care has been put into compiling a comprehensive
training dataset, inevitably there might be some false positives
and negatives. If you find any, please do let me know, or even
better, submit a pull request on
GitHub.
Open source
The entire code and files are available on GitHub so feel free to have a look. On GitHub you'll also find more details on how to use the model in your own projects.
Machine Learning is so amazing! ✨
Check this website, CHEAP V1aGR4! fakedoctor.com