/$$$$$$$ /$$$$$$$$ | $$__ $$| $$_____/ | $$ \ $$| $$ | $$ | $$| $$$$$ | $$ | $$| $$__/ | $$ | $$| $$ | $$$$$$$/| $$$$$$$$ |_______/ |________/ /$$$$$$ /$$$$$$ /$$$$$$$ /$$$$$$$$ /$$__ $$ /$$__ $$| $$__ $$| $$_____/ | $$ \__/| $$ \ $$| $$ \ $$| $$ | $$ | $$ | $$| $$ | $$| $$$$$ | $$ | $$ | $$| $$ | $$| $$__/ | $$ $$| $$ | $$| $$ | $$| $$ | $$$$$$/| $$$$$$/| $$$$$$$/| $$$$$$$$ \______/ \______/ |_______/ |________/ Paraphrasing Text without Machine Learning in JavaScript ===================================================== Edited: May 6th 2025 Written by Mason Wright ===================================================== Being able to summarize a article is super useful and surprisingly it doesn't require machine learning to accomplish. Here we will go over the concepts behind frequency analysis and implement it to build a text summerization api. The type of summarization we will be doing here is called "extractive summarization", extractive summarization relies on selecting the most important sentences in the text and combining them in a way that makes sense. The advantage of using algorithmic summarization over a machine learning approach is speed and cost. Algorithmic summarization trades off quality over speed and accessibility. However, in most cases, the quality of the extracted summarization is high enough not to matter. How it works The steps in extractive text summarization are first to break apart the sentences of the inputted text. This can be done using a regex approach like the one below. +------------------------------------------------------------+ | function breakSentence(sentences) { | | return sentences | | .replaceAll(/\s+/g, " ") | | .replace(/\[[0-9]+\]/g, "") | | .replace(/(? { | | return { | | sentences: breakSentence(e), | | index: i, | | }; | | }); | | | | //Index sentences in the document | | paragraphs.forEach((paragraph, pi) => { | | paragraph.sentences.forEach((sentence, index) => { | | var words = sentence | | .split(" ") | | .filter((n) => stoplist.indexOf(n) == -1); | | doc.push({ | | sentence, | | words, | | index, | | paragraph: pi, | | }); | | }); | | }); | +------------------------------------------------------------+ Currently, when breaking the text part, we are first breaking the text into paragraphs. This is not needed at this time (as long as you remove the paragraph spacing) but will be used once I figure out how I want to improve this algorithm. My idea is to identify the important sentences, then add context from the paragraphs they are found in. I have not completed this yet, but I plan to and will remove this once it is complete. If you would like to help feel free to improve it on the gist! Stopwords Stopwords are words that do not add any meaningful context to a sentence, because of this we will need to remove them from each sentence so they do not skew the results. You can get lists of stop words anywhere, wordnet is typically the source for most of them. However, we will not be using either instead to make things faster we will use a condensed word list that covers most bases (This will be contained in the GitHub gist link at the bottom). Assigning Sentence Frequencies The rest of the process is calculating how frequently each non-stopword in a sentence shows up when compared to the entire document. To do this we need to loop through each word in each sentence and compare it to each word in every sentence. This has some overhead but is typically a pretty lightweight operation. +------------------------------------------------------------+ | //Assign word frequencies | | doc.forEach((item) => { | | var count = 0; | | item.words.forEach((word) => { | | var match = word; | | doc.forEach((item2) => { | | item2.words.forEach((word2) => { | | if (word2 === match) count++; | | }); | | }); | | }); | | count = count / item.words.length; | | item.frequency = count; | | }); | | | | doc.sort((a, b) => { | | return b.frequency - a.frequency; | | }); | +------------------------------------------------------------+ Once we have the word frequencies we need to sort them from highest to lowest. From here you can slice the amount you want from the front or you can set up the function to give you the best amount of sentences. If you want to have a set amount of sentences, you will need to extract the sentence property from the objects in the array and slice the amount you want. To extract the "best" amount of sentences, you can use the method I talked about in my last article https://decode.sh/redirecting-users-to-the-correct-page-after-a-404-error.txt just slightly modified. +------------------------------------------------------------+ | let slicePoint = 0; | | | | let scores = doc.map((e) => e.frequency); | | | | let avg = average(scores); | | | | for (let i = 0; i < scores.length; i++) { | | if (avg - average(scores.slice(i)) > 0.4) { | | slicePoint = i; | | break; | | } | | } | | doc = doc.slice(1, slicePoint); | | | | return doc; | +------------------------------------------------------------+ +------------------------------------------------------------+ | function average(array) { | | return array.reduce((a, b) => a + b) / array.length; | | } | +------------------------------------------------------------+ I go more into how this works in the other article, but the basic principle it operates on is we only want to select the sentences that have the most impact in the article. Selecting a set number of sentences can either miss context or include sentences that could be removed. So we calculate the average "score" in this case from the frequency property. Then remove elements from the array one at a time, starting from the highest value until the average of the array is x value lower than the original. In this example I ended up using 0.4, but you can change it to keep more or less if you prefer. https://gist.github.com/lakefox/f67e5c34ddbcafde2a4c5f030db55a70 This the gist containing the full code. If you would like to contribute to this please leave any changes in the gist.