Paraphrasing Text without Machine Learning in JavaScript
Being able to summarize a article is super useful and surprisingly it doesn't require machine learning to accomplish. Here we will go over the concepts behind frequency analysis and implement it to build a text summerization api.
The type of summarization we will be doing here is called "extractive summarization", extractive summarization relies on selecting the most important sentences in the text and combining them in a way that makes sense. The advantage of using algorithmic summarization over a machine learning approach is speed and cost. Algorithmic summarization trades off quality over speed and accessibility. However, in most cases, the quality of the extracted summarization is high enough not to matter.
How it works
The steps in extractive text summarization are first to break apart the sentences of the inputted text. This can be done using a regex approach like the one below.
1function breakSentence(sentences) {
2 return sentences
3 .replaceAll(/\s+/g, " ")
4 .replace(/\[[0-9]+\]/g, "")
5 .replace(/(?<!Mr|Mrs|Ms|Dr|Sr)([\.?\??\!?]) ([A-Z])/gi, "$1{break}$2")
6 .split("{break}");
7}
This function, I have found covers most edge cases when splitting text by sentences. The "easy" approach of splitting by periods falls short when you encounter a title like Mr.
or when the sentence ends in another form of punctuation. The function will identify the end of a sentence by looking for a period, exclamation point, or a question mark that is not after a title and is followed by a space and an alphanumeric character. Some cases that this approach covers without explicitly defining them are titles like Ph.D
as the period is not followed by a space and a character. Once the breakpoints are identified, they are replaced with the following text {break}
. This was used as the break point as it would be very uncommon to see in normal text. After the breakpoints are inserted, the remaining text is split by the inserted text.
For the text summarization tool we will be making, we will use the code below to split our sentences and extract non-stopwords from the sentence.
1let doc = [];
2let stoplist = [...] // This will be linked at the end
3let paragraphs = text.split("\n").map((e, i) => {
4 return {
5 sentences: breakSentence(e),
6 index: i,
7 };
8});
9
10//Index sentences in the document
11paragraphs.forEach((paragraph, pi) => {
12 paragraph.sentences.forEach((sentence, index) => {
13 var words = sentence
14 .split(" ")
15 .filter((n) => stoplist.indexOf(n) == -1);
16 doc.push({
17 sentence,
18 words,
19 index,
20 paragraph: pi,
21 });
22 });
23});
Currently, when breaking the text part, we are first breaking the text into paragraphs. This is not needed at this time (as long as you remove the paragraph spacing) but will be used once I figure out how I want to improve this algorithm. My idea is to identify the important sentences, then add context from the paragraphs they are found in. I have not completed this yet, but I plan to and will remove this once it is complete. If you would like to help feel free to improve it on the gist!
Stopwords
Stopwords are words that do not add any meaningful context to a sentence, because of this we will need to remove them from each sentence so they do not skew the results. You can get lists of stop words anywhere, wordnet is typically the source for most of them. I get an NPM package for automatically removing stopwords is... stopword. However, we will not be using either instead to make things faster we will use a condensed word list that covers most bases (This will be contained in the GitHub gist link at the bottom).
Assigning Sentence Frequencies
The rest of the process is calculating how frequently each non-stopword in a sentence shows up when compared to the entire document. To do this we need to loop through each word in each sentence and compare it to each word in every sentence. This has some overhead but is typically a pretty lightweight operation.
1//Assign word frequencies
2doc.forEach((item) => {
3 var count = 0;
4 item.words.forEach((word) => {
5 var match = word;
6 doc.forEach((item2) => {
7 item2.words.forEach((word2) => {
8 if (word2 === match) count++;
9 });
10 });
11 });
12 count = count / item.words.length;
13 item.frequency = count;
14});
15
16doc.sort((a, b) => {
17 return b.frequency - a.frequency;
18});
Once we have the word frequencies we need to sort them from highest to lowest. From here you can slice the amount you want from the front or you can set up the function to give you the best amount of sentences. If you want to have a set amount of sentences, you will need to extract the sentence property from the objects in the array and slice the amount you want.
To extract the "best" amount of sentences, you can use the method I talked about in my last article just slightly modified.
1let slicePoint = 0;
2
3let scores = doc.map((e) => e.frequency);
4
5let avg = average(scores);
6
7for (let i = 0; i < scores.length; i++) {
8 if (avg - average(scores.slice(i)) > 0.4) {
9 slicePoint = i;
10 break;
11 }
12}
13doc = doc.slice(1, slicePoint);
14
15return doc;
1function average(array) {
2 return array.reduce((a, b) => a + b) / array.length;
3}
I go more into how this works in the other article, but the basic principle it operates on is we only want to select the sentences that have the most impact in the article. Selecting a set number of sentences can either miss context or include sentences that could be removed. So we calculate the average "score" in this case from the frequency
property. Then remove elements from the array one at a time, starting from the highest value until the average of the array is x
value lower than the original. In this example I ended up using 0.4
, but you can change it to keep more or less if you prefer.
Here is the gist containing the full code. If you would like to contribute to this please leave any changes in the gist.