using wget to get information from a web page, then using grep and sed to remove formatting
process: download files (script: base-get-book.sh )
#####################################
wget -O INFO-1 "http://www...page 1
wget -O INFO-2 "http://www...page 2
wget -O INFO-3 "http://www...page 3
#####################################
process: strip out formatting to get specific content
(script: extract-text-book.sh)
#####################################
#!/bin/bash
#####################################
mkdir INFO
for x in `ls INFO-*`
do
#####################################
cat $x | grep "text $x" | sed 's/New American/ New American/g' | \
sed 's/[^<]*<\/span>/ /g' | \
sed 's/<[^>]*>//g' | sed 's/\ [[^]*\]]//g' | sed 's/ \([1-9]\)/\n\1/g' > $x.txt
#####################################
mv $x INFO
done
#####################################
process: make sure all files have proper names for sorting
ls > fix
#####################################
vi fix
remove lines that have digits that will sort: e.g. filename-10.txt
leave lines that need to have a zero added, then:
:%s/.*/mv & &/g
results: mv filename-1.txt filename-1.txt
then cursor to the number on the right, and insert a 0 before 1, to get: filename-01.txt
repeat (usually 1-9) until done.
:wq
#####################################
sh ./fix ; rm -f fix
#####################################
ls -al (should show all files sorted nicely)
#####################################
process: remove remaining internal brackets
use script to remove brackets:
sh ./remove-brackets.sh *
#####################################
script: remove-brackets.sh
#!/bin/bash
echo "remove bracketed footnotes"
perl -pi -e 's/\[.\]//g' *.txt
perl -pi -e 's/\[..\]//g' *.txt
#####################################
process: edit all the files to remove errors, line up on one line
vi INFO-*
after making sure there is a number on the left and all the text on one line, then
:%s/.*/INFO 9:&/g
:wn
:%s/.*/INFO 10:&/g
:wn
:%s/.*/INFO 11:&/g
:wn
# where INFO is the book or file name, and 9 is the chapter.
# The ampersand places the text string with line number to the right
repeat until done editing all files
#####################################
process: copy files to proper directories and other servers:
cp INFO-*.txt ../SORTED-INFO
scp -r ../SORTED-INFO/ 192.168.11.11:/home/luser/FILES/SORTED-INFO
or
rsync -r ../SORTED-INFO/ 192.168.11.11:/home/luser/FILES/SORTED-INFO
#####################################
repeat until all 1,189 chapters are cleaned up and sorted...
then test to see that there will be 31,102 lines, and based on the version, 781,621 words
cat INFO-* > total-info.txt ; cat total-info.txt | wc
(note: as of 3/7/2017 I'm on 14/66)
|
|
![]() Wagoneers FULL SIZE JEEPS JeepMeister "Jeep is America's -Enzo Ferrari MeisterTech Diesels + |
One Page Overview of Linux Commands click for an image of the 5 essential Linux commands An Intro to Linux |
at Midway Auto on SR9 in Snohomish, or at Northland Diesel in Bellingham, WA |