Removing Broken lines from Text

Removing Broken lines from Text is a WordSmith class on how to remove broken lines, linefeeds before a sentence punctuation (.?!) The cause of this situation (when you create a theWord module) is usually because you copied and pasted text from the Internet (a page in html format). While the problem is extremely common, the solution (apart from going through the text by hand and fixing everything) is not very readily seen (and I have looked).

First some general points on how to fix this

If your text is around 20 lines long, and you do this once every 6 months, fix it by hand and be happy. You are fortunate. But if you make several eSword or theWord modules each week, and some of the works you produce are public domain texts that are thousands of words/sentences in a typical chapter, you need some help.

First of all, for this situation, you simply cannot rely on the manual method. Your time is to expensive to waste it in fixing these texts by hand. It is good to do so for a while to get a feel for what is going on as far as need. I consider this as being a WordSmith. A WordSmith is somebody that has a job of working with text.


Old Carpenter Tools of his Trade
is an explanation of why I, Pastor-Missionary David Cox, write my own materials like tracts, books, sermons, Sunday School material, etc. We produce the material that we use in our ministry and also for evangelism.
Read the short article: Old Carpenter Tools of his Trade.

The next thing you need is a good work processor. And there are word processors like Microsoft Word, but they are “good” for other things rather than search and replace. Here I am talking about searching and replacing maybe 200-300 sets of search words or phrases. For example, replace all the references like Psa cxix:4. to Psa 119:4. If you search and replace Roman numerals and just in Psalms, you are going to have to go up to 150 chapters.

Although Microsoft Word was what I cut my teeth on, and its macro language in Microsoft Word back in 1986-1990 was excellent, there was a problem that continues in Microsoft products. They get complicated way too fast and way too much. From the last time I looked, you really need to know programming languages (one is not enough) in order to use it well. We are doing text manipulation not rocket science here.

So I pass over Microsoft Word (although I do use it for regular document creation like when I write something myself for my own church), but I found LibreOffice (OpenOffice clone) much better for these purposes.

The best way I have found to do all of this is to make macros, have them ready to use (icons on the LibreOffice top menu bar) and tweak them constantly.

Sample text

But the Christian has far higher thoughts of the work of Christ. Not even

the doctrine of substitution will satisfy him, for this is but one aspect

of a far greater truth. It is his joy to know that he is one with Him who

died and rose again. And his acceptance is not in Adam restored in virtue

of that death, but in Christ as Head of the New Creation. Old things are

passed away; all things are become new. And, “therefore, if any man be in

Christ he is a new creature.” And the wave-sheaf was but the firstfruits of

the harvest. Our oneness with Him begins with his death, but it does not

end with his resurrection. The believer shall be like Him who died and rose

again. “We know that we shall be like Him, for we shall see Him as He is.”

But here let us pause, lest the recoil from the gospel of the crucifix

should throw us off the line of truth. “As we have borne the image of the

earthy, we shall also bear the image of the heavenly.” But let us not

forget that the Resurrection lies between the two. “Sudden death, sudden

glory,” is an epigram which represents a system of doctrine that is always

popular. But is it Scriptural? The

reserve which Holy Writ maintains respecting the intermediate state forbids

all dogmatism on a subject so full of mystery. But our Lord’s words to the

dying thief seem to me clearly to warrant the belief that that mysterious

sleep is consistent with conscious enjoyment of his presence. And if the

inspired apostle did not cherish that belief, his language to the

Corinthians and Philippians seems scarcely intelligible. But further than

this we may not go.

When we speak of glory, or of the activities of service, before the day of

full redemption – the day of the coming of the Lord – we allow our thoughts

to be swayed by sentiment in a sphere where Scripture should control them

absolutely. The Christian can triumph over death. But it is a triumph

of faith achieved in presence of stern facts to the reality of which both

our reason and our senses bear signal testimony. In itself death is utterly

horrible and hateful; and those who cannot see beyond it may well shrink

back in terror, and seek to conceal its loathsomeness beneath bright

trappings and wreaths of flowers. But the Christian, in the power of faith

in his risen and glorified Lord, can dare to face the facts, and, with full

realization of their repulsiveness and horror, calmly to utter the

redemption challenge, “0 Death, where is thy sting? 0 Grave, where is thy

victory?”

This text look okay to you? It should. It is in html format and this is the Internet. But it isn’t good for a module. See image below.

linebreak sample text
linebreak sample text

First of all,

notice the breaks at the wrong places at the end of the physical lines. In html, each line has these breaks, and that is “good” HTML by the way. Otherwise the line would continue off the screen and not wrap to the next line what we would expect.

So we need to first of all see what is going on, and then second of all, we need to manipulate the text such that we can fix things, and then third of all, we need to make a production method to automate this.

Sample from LibreOffice

In the image at left/above, you will notice a paragraph mark at the end of each line. From HTML it can either be a linefeed or a paragraph mark, both cause the same problem, and your fixer up macro for this must take care of both.

Here I am just going to say that LibreOffice is not the end all of text processors to use. Okay. We got that out of the way. I chose LibreOffice because #1 it does the job cleanly (the sloppy parts are because the person writing the macros is not so smart, and in these examples, that is me, David Cox). #2 Libreoffice is free, and they are updating the support for it.

LibreOffice to the Rescue

Before I get into the nitty gritty here, download Libreoffice and install it into your computer. Once you have done that, play around with it for an afternoon to get familiarized with it. Note that it is a Microsoft Word knock-off, so what you can do in Microsoft Word, you probably can do in Libreoffice. It might not have all the bells and whistles of Microsoft Word, by probably 90% of them. In playing around with it, search for the LibreOffice Extension Alternate Search and Replace. (In Libreoffice, open Tools-> Extension Manager, and look for it there.) This is an additional plugin that is offered for free also. It will add a green binoculars to menu bar, at the far left.

The search language in LibreOffice is good and more or less simple (i.e. it is simpler than learning python for using Microsoft Word). But it is still a bit stiff on the learning curve. But inside the Alt Search and Replace plugin, there is an option with a button called [Batch], click on that and there is preloaded with the download of the plugin a number of sample search and replaces. One key thing is that some begin with “Text [All}” and others with “Text [Sel]”. Within the alternative search and replace, you can run the macro on the entire text, or just select part of the text and run the macro on that part of the text. Here if a work has a poem, select the poem and write the macro to search for paragraph marks and replace them with linefeeds. I do not know of any other word processor (I doubt even Microsoft Word will do this). So the power of using Alt Search and Replace (only in LibreOffice and OpenOffice) is tremendous.

The Actual Macro

[Name] Text [all] Linebreaks or newlines to space

[Find]\.$
[Replace].[newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\.”$
[Replace].”[newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\?$
[Replace]?[newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\?”$
[Replace]?”[newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\!$
[Replace]![newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\!”$
[Replace]!”[newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]^$
[Replace][newparagraph]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find] \p
[Replace]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\p
[Replace]
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\[newparagraph\]\[newparagraph\]
[Replace]\p
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\[newparagraph\] \[newparagraph\]
[Replace]\p
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]\[newparagraph\]
[Replace]\p
[Parameters] MsgOff Regular
[Command] ReplaceAll

[Find]
[Replace]
[Parameters] MsgOff Regular
[Command] ReplaceAll

Record a short macro that does something simple in Microsoft Word or Libreoffice. Then open the macro editor and look at it. What you will find is something that if you understand programming, great. Otherwise, you will probably take a few weeks of studying this macro to figure it out. They are complicated. They are actual a programming language. The above is not. It is simple, can be reproduced without much of a headache.

The macro has a line naming it, and the search (Find) and replaces are straightforward with a parameters line and a command (do it) line. Let me just give you a quick tour of these commands above. There are 13 individual search and replace commandments.

MsgOff – Turn the popup window saying x number of replaces off after the search completes. This above code will have 13 popups if you don’t disable them. That is 13 click to continue points. Try it without it but my advice is to these quick and dirty macros for the most part.

Regular- Is if the search is a regular expression or not.

Command: Replaceall – basically a do it point.

Regular Expressions

Let me explain here, that we are looking for paragraph marks, and they are not normally addressed in search and replace macros in any word processor. It is only very lightly that a text editor is going to address or give the possibility of selecting and manipulating something by means of a regular expression. For what we want, we need that expansively.

Wow! So this plugin used in LibreOffice is extremely detailed and powerful as far as manipulating text on a micro level. This is what we want. I am still a novice with Regular Expressions myself, but I have read through help guides on them.

On the one hand, they are extremely helpful, and you can do things using them that would be otherwise be totally impossible to do.

On the other hand, they are extremely dangerous. If you don’t watch out for what you are doing, things can get messed up quickly. NEVER MAKE A MACRO AND RUN IT ON YOUR TEXT WITHOUT TESTING IT TO MAKE SURE IT IS WORKING CORRECTING!

A corollary to that rule is to always backup everything. First of all, back up your macros in case you tweak it and ruin it. Secondly, in case your hard disk goes bad one day. I have made these macros several times. At first it took me an entire day, and somethings various days spread out over months to get a macro to do what I want. When my harddisk went bad, I didn’t even think about the macros. But I lost them. Then one day I made a macro that needs something like Roman Bible references to Arabic numbers, and guess what, no copy in any of my back ups.

Also you have the original text in the theWord program, copy it to LibreOffice. Make a back up of your modules before you start doing too much modification. I have lost both the text in theWord and in LibreOffice because a macro caused every to freeze (or I pressed the wrong key), and I had to reboot the computer. TheWord module was blank in the topic I was using, and the LibreOffice document was also. This has happened a tremendous number of times with Windows 10, because I am working on something, somebody calls me or comes to visit, or my wife wants me to run to the store, and when I get back to my computer, it decided since I wasn’t using it anymore, it would totally shut down. Reopening things, theWord and libreoffice didn’t have the original versions of the text, but one that had problems.

Even if I make it go to sleep mode, it doesn’t come back right every time. Sometimes, sleep mode just is a lie and it has to totally reboot. Thank Microsoft for your automatic updates that makes this happen!

The Guts of how this Macro Works

Okay, the above are the tools, but you need to get into your mind HOW THE MACRO WORKS.

First of all, we are going to do some search and replaces as preliminary tasks. How do you replace the linefeeds at the end of all of these lines? When you do it simply, you end up with a block of text with absolutely no paragraph marks, and this is bad. Unreadable.

So we want to separate out all the valid paragraph marks, Keep them. And then delete all the unwanted ones.

Step #1 Replace valid Paragraphs with a place holder

So I make some presumptions. Depending on the text (which changes greatly from source to source) you might need to either copy parts of your text and do this macro on them and then replace them into the module, or make another macro for different situations/texts. Remember that the macro is only as smart as the guy or gal writing it, designing it. So my macros could just not work.

Assumption #1: Any double paragraphs should probably indicate a valid paragraph. But we don’t want double paragraphs but proper formatting of a paragraph (space above and below).

Assumption #2: Any linefeed ending in a text sentence finality marker should be a valid paragraph. What is a “text sentence finality marker“?

. (a period) | ? Question mark | ! Exclamation mark

Add to these three the variations,

.” | ?” | !”

A valid paragraph can also end with

.) | .] | .}

But for each variation you add, you also have to search the entire text again and again, and for 1 page of text, no problem. For 100 pages of text, this macro is going run for a while, and literally, you can go get a cup of coffee and drink it all before it is over. Even go get a shower and still running when you come back.

I am going to search the text for valid paragraphs and make a placehold for them with something that should never naturally occur in the text “[newparagraph]” and then come back at the end of the macro to substitute a regular paragraph mark for this place holder.

Note: I am using [newparagraph] for a good while now, and the square bracket is a regular expression. Who knew? So I had to escape them out "\[" and "\]" in order to not mess things up. Literally, I joke not, one of the two of them is the symbol for the end of word regular expression and I was getting a new paragraph at the end of every word. Garbage goobly gook.

Problem double spaces

At first I ran it and did about 50 topics with it. Then I noticed that the macro produces a double space everywhere a newlinefeed is replaced. So I added code to search for double spaces and replace with single spaces. That will probably come back to haunt me in the future.

Then I decided, no, the solution to this problem of double spaces is to first search for space newlinefeed and replace those with just a space, and then newlinefeed and then those with a space before them. This took care of the double space problem.

Again, I warn you, there are assumptions that I am making like the above, and depending on the text, they may be valid or not.

The Example of letters

blah blah blah.

Regards,
David Cox

The above Regards, is a valid end of paragraph. It would be reformatted by the macro to

Regards,David Cox

So be careful what you run through the washer, because somethings will come out with unexpected results (white gym socks with a red t-shirt for example will make your socks pink). Here the same precautions.

So while it takes a while to run, it does do the job. Unfortunately, there are sometimes 3000 or more replaces that takes place on the text of just a single topic or chapter. So if you work has 300 chapters, take a week to fix it. But it is worth it for a quality module.

Please add in the comments of this post any easier or faster or more accurate, or even alternate way of doing the same. I would be greatly pleased with seeing your solution if it is different than mine.

More Articles from this Category