#318069 - 13/01/2009 18:03
bash scripting (xml parsing) help... (mainly awk and sed)
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I wanted a solution to grab trailers from Apple so that I can feed them into my SageTV media center and browse them within its library. Since I haven't been able to find a ready-made solution I thought I'd need to cook something up myself. But, I managed to find a bash script written by someone who uses a different media center program. It works but did a few things I didn't like to the filenames of the trailers, so I've been modifying it. I'm running the script from cygwin on a windows box, so I don't have a full set of tools that might otherwise be installed on something like my web server (or I would have likely tried redoing the whole thing in PHP. The script is below (I've commented out some functional lines to allow me to test the values of some variables (the part I'm having issues with right now). Here's also a link to the original: http://forum.team-mediaportal.com/plugins-47/mytrailers-42622/index11.html#post291349Basically you set a path to store a DB which keeps track of what's already been downloaded and another path to store the files. You set a parameter which tells the script whether to try and get 1080 versions. It reads in the XML for Apple's trailer RSS feeds and parses out the filenames and other fields. I am trying to modify the original to create a folder for each trailer and name that folder and the trailer according to the name of the movie. That is, instead of the original filename which doesn't have any spaces and may contain extra characters like "tlra_640w" etc... I don't know very much about how to use awk or sed, nor much about scripting with bash, those are the reasons I'm posting. Main issue at the moment is being able to parse the fields grabbed from the XML. They're created in TRAILERS and used to have a semicolon between them. I've changed this to a ";field" to see if I can pick out this field separator from any other legitimate use f semicolon within the data. But it's still failing to get the name "Angels & Demons" which is the first movie name with a space and a semicolon within it. We'll move to the subject of re-encoding the & and similar later. This line specifically:
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
Is only bringing back the word "Angels" from the above. Probably failing because of the space, but I don't know how to make it bring back its results enclosed in quotes.
#!/bin/bash
GET1080p=0
GETPOSTER=1
SAVEPATH="v:/Movies/zzztrailertest/"
DLDBPATH="d:/AppleTrailers/"
FEEDS="http://www.apple.com/trailers/home/xml/current_720p.xml http://www.apple.com/trailers/home/xml/current.xml"
tail -5000 $DLDBPATH.downloaded.db > $DLDBPATH.downloaded.db.tmp
mv $DLDBPATH.downloaded.db.tmp $DLDBPATH.downloaded.db
for FEEDURL in $FEEDS; do
TRAILERS=`xml sel --net -D -T -t -m "/records/movieinfo"\
-v "@id" -o ";field"\
-v "info/title" -o ";field"\
-v "info/postdate" -o ";field"\
-v "preview/large" -o ";field"\
-v "poster/xlarge"\
-n $FEEDURL`
for MOVIE in $TRAILERS; do
MOVIEID=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $1 }'`
MOVIETITLE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $2 }'`
MOVIETITLEFILE=`echo $MOVIETITLE |sed 's/.*\///'`
#temporary output to show grabbed title
echo "=======##### Title: $MOVIETITLE -----------------"
POSTDATE=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $3 }'`
BEXTENSION="[Trailer].mov"
PREVIEW=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }'`
PREVIEWFILE=`echo $PREVIEW |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
NEWPREVIEWNAME="$MOVIETITLE $BEXTENSION"
PREVIEW1080p=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $4 }' |sed 's/a720p\.mov$/h1080p.mov/g'`
PREVIEWFILE1080p=`echo $PREVIEW1080p |sed 's/.*\///' |sed 's/\.mov$/.hdmov/g'`
NEWPREVIEWNAME1080p="$MOVIETITLE $PREVIEWFILE1080p"
POSTER=`echo $MOVIE | awk 'BEGIN { FS = ";field" } ; { print $5 }'`
NEWPOSTERNAME="folder.jpg"
MOVIESAVEPATH="$SAVEPATH$MOVIETITLE/"
#if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# mkdir $MOVIESAVEPATH
#fi
if [ "$GET1080p" -eq "1" ]; then
if `echo $FEEDURL | grep -q 720p`; then
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME1080p" $PREVIEW1080p; PREVIEWOUT1080p=$?
if [ $PREVIEWOUT1080p -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME1080p" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW1080p FAILED -- TRYING ORIGINAL 720p URL NEXT"
fi
fi
fi
fi
if ! grep -q "###$MOVIEID.PREVIEW" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPREVIEWNAME" $PREVIEW; PREVIEWOUT=$?
if [ $PREVIEWOUT -eq 0 ]; then
echo "###$MOVIEID.PREVIEW $NEWPREVIEWNAME" >> $DLDBPATH.downloaded.db
else
echo "##### ID:$MOVIEID URL:$PREVIEW FAILED -- RETRY NEXT RUN"
fi
else
echo "##### ID:$MOVIEID NAME:$NEWPREVIEWNAME MARKED DONE -- SKIPPING"
fi
if [ "$GETPOSTER" -eq "1" ]; then
if ! grep -q "###$MOVIEID.POSTER" $DLDBPATH.downloaded.db; then
# wget -c -O "$MOVIESAVEPATH$NEWPOSTERNAME" $POSTER; POSTEROUT=$?
if [ $POSTEROUT -eq 0 ]; then
echo "###$MOVIEID.POSTER $NEWPOSTERNAME" >> $DLDBPATH.downloaded.db
else
echo "##### $ID:$MOVIEID URL:$POSTER FAILED -- RETRY NEXT RUN"
fi
else
echo "##### ID:$MOVIEID NAME:$NEWPOSTERNAME MARKED DONE -- SKIPPING"
fi
fi
done
done Sample XML file (this is what it's parsing when it hits Apple's feed): http://mypocket.com/current_720p.xml.zipAs mentioned, I'll also need to clean up the results by re-encoding things like & back to "&" and I have no idea how to do that from this script. In PHP I can decode those using a function call to html_entity_decode(). One of the remaining things to look at is I don't know if the usage of sed that's specified when grabbing the name is sufficient. Because I'm using the movie name to create a folder and a file, I can't have things like colons, slashes or other invalid characters be used. I can't guess what will be coming up in future movie names, so it's possible that invalid characters may be present either originally as plain text or from decoding any html entities (such as greater-than or less-than).
|
Top
|
|
|
|
#318070 - 13/01/2009 18:05
Re: bash scripting (xml parsing) help...
[Re: hybrid8]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31605
Loc: Seattle, WA
|
Brilliant idea. I'd love to have the current apple trailers page listed in my DVR's menu. Hm, my new DVR is networkable, maybe I can set that up somehow too.
|
Top
|
|
|
|
#318071 - 13/01/2009 18:08
Re: bash scripting (xml parsing) help...
[Re: tfabris]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
While this script can be modified to make some type of textual listing with links to individual trailers, I'm actually trying to download them all. This will run daily (cron/schedule) to keep me up to date. I'll eventually come up with something to allow me to expire or remove trailers. Though SageTV does allow me to perform deletions right from its UI.
|
Top
|
|
|
|
#318076 - 13/01/2009 18:24
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
That one line seems to work correctly for me. What does $MOVIE contain at that point?
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318078 - 13/01/2009 18:46
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
I'll have to dump out the value of $MOVIE, but $MOVIETITLE contains only 'Angels' whereas I'd like it to contain 'Angels & Demons'
|
Top
|
|
|
|
#318079 - 13/01/2009 19:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, $MOVIES doesn't contain the necessary information, so the problem is before the awk to $MOVIETITLE
$MOVIE immediately after the beginning of the for loop contains only the ID and the movie name up to the first word Angels.
The next time it passes through the loop it continues from where it left off, producing invalid results where the $MOVIE var contains only an ampersand
$TRAILERS does contain everything as I expected it to be. The ID, full movie title that already appears to have the ampersand converted from an html entity, date and the rest of the info, for every movie in the XML file.
I'm at least as stuff as I was originally though, having no idea why $MOVIE doesn't contain what I'd expect it to (the full contents of a single "row").
|
Top
|
|
|
|
#318080 - 13/01/2009 19:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Hmmm.. I'm the one that added the movie title to the collection of fields with this line:
-v "info/title" -o ";field"\
Then I obviously had to adjust the position of the fields extracted. Before I made that change, none of the fields captured from the XML contained spaces. Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR.
Edited by hybrid8 (13/01/2009 19:11)
|
Top
|
|
|
|
#318081 - 13/01/2009 19:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4181
Loc: Cambridge, England
|
Maybe I should encode the space so that it isn't actually a space? It seems like the space is what's causing the termination of the MOVIE variable in the FOR. Yes, by default "for" splits words at any whitespace (space, tab, newline). To stop it doing that -- to make it split words at newlines only -- set the shell variable IFS to "\n".
$ cat > hybrid8.txt
a b
c
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a
b
c
$ IFS="\n"
$ for i in `cat hybrid8.txt` ; do echo $i ; done
a b
c
$
Peter
|
Top
|
|
|
|
#318082 - 13/01/2009 19:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Oh. It probably has nothing to do with the ampersand and everything to do with the space.
"for ... in" splits on whitespace, not newline. Before I spend a lot of time going down this road, delete the "Angels & Demons" line and see if it breaks similarly on "Astro Boy".
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318083 - 13/01/2009 19:14
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: peter]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318085 - 13/01/2009 19:39
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Can anyone offer a suggestion as to why setting IFS='\n' causes it to use the "n" character as the field separator? I saw another suggestion elsewhere to use a WHILE loop instead of a FOR to avoid changing IFS default. EDIT: Setting it like this Seems to work. Now I'm just doing some digging to best be able to clean the names before creating files or folders from them (no slashes, colons, gt, lt, etc..) Does anyone already have a suitable script or pointer to something somewhat universal for this?
Edited by hybrid8 (13/01/2009 20:45)
|
Top
|
|
|
|
#318090 - 13/01/2009 21:06
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Awk can do that rather trivially:
echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}'
Just add your list of permitted characters/ranges into the "a-z" regular expression. It's much easier to list permitted stuff than to try and enumerate all of the forbidden characters.
Cheers
|
Top
|
|
|
|
#318091 - 13/01/2009 21:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Awk can do that rather trivially:
echo "anythingyouwant234234" | awk '{gsub("[^a-z]","");print}' So, for example, from a shell script one could use this sequence: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub("[^- a-zA-Z0-9_$+]*","");print}'` EDIT: fixed some issues above nowThat's probably a good start at it.
Edited by mlord (13/01/2009 21:12)
|
Top
|
|
|
|
#318092 - 13/01/2009 21:11
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Thanks Mark.
How about having SOME forbidden characters so that I can replace them with specific alternatives? I suppose I could do that before passing the result to the awk example you posted (making sure to include whatever my alternatives are in the list of allowed characters).
In your example, anything not defined in the substitution is replaced with "" correct?
So if I create another command and I omit the ^ which is a NOT if I recall, I can include a list of characters to be replaced.
|
Top
|
|
|
|
#318093 - 13/01/2009 21:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Sure thing.. just start with the corrected version I just fixed.
|
Top
|
|
|
|
#318094 - 13/01/2009 21:15
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
You can add more gsub calls inside the single awk invocation. For example, this version replaces all spaces with underscores, and then does the regular vetting of the result: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | awk '{gsub(" ","_"); gsub("[^- a-zA-Z0-9_$+]*","");print}'` EDIT: also note that, if a dash - character is wanted inside the [] expression, it has to be FIRST, or immediately after the negation ^ character if present.
Edited by mlord (13/01/2009 21:19)
|
Top
|
|
|
|
#318095 - 13/01/2009 21:18
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, great. I should be able to get what I need with this. Just have to figure out how to extend the ranges in the last sub to include more valid characters like brackets, etc...
If I want to include a quote can I escape it with a backslash? Like so ' \" '
And will I be able to easily include a single quote (actually a normal ascii apostrophe) with "'" ?
How about allowing square brackets? Need to be escaped I suppose?
Lastly, is the space after the dash you mentioned there to delimit that dash from the rest of the characters? So if I want to include a space character as allowed, can I add it anywhere within the [] ?
Edited by hybrid8 (13/01/2009 21:23)
|
Top
|
|
|
|
#318096 - 13/01/2009 21:22
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Ok, great. I should be able to get what I need with this. Just have to figure out how to extend the ranges in the last sub to include more valid characters like brackets, etc...
If I want to include a quote can I escape it with a backslash? Like so ' \" ' Yeah, except it can get very messy and confusing because the shell itself may also try to interpret some things like that before passing the strings to awk. So some double escapes might be needed. It's because of that fuss, that I normally would just do the whole script as an awk script rather than a bash script. No double escaping needed, and a heck of a lot less confusing. Awk is kinda nice for this stuff, with its C-like control structures and free typing of things. But difficult to remember if one only uses it once a year or less. I use it weekly here -- my fav programming language! Cheers
Edited by mlord (13/01/2009 21:34)
|
Top
|
|
|
|
#318097 - 13/01/2009 21:30
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
For example, one could separate out the filename sanity stuff into it's own script in a separate file, like this: EDIT: added some html escapes for fun. EDIT: fixed the ordering of a few things.#!/usr/bin/gawk -f
{
## spaces to underscores:
gsub(" ","_")
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("&","\\&")
## square brackets into round brackets:
gsub("\\[","(")
gsub("\\]",")")
## double-quotes into apostrophes:
gsub("\"","'")
## sanitize the rest:
gsub("[^- 'a-zA-Z0-9_$+&<>]*","")
## dump it to stdout
print
}
Which could be saved as sanitize.awk, and then be used like this: NASTYNAME="whatever.. "
GOODNAME=`echo "$NASTYNAME" | sanitize.awk`
Edited by mlord (13/01/2009 21:46)
|
Top
|
|
|
|
#318098 - 13/01/2009 21:48
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
What's up with the double escaping on the square brackets? I think this is the solution I'll use - the cleaner as its own file. Some additional q's... Do I need to escape out smart quotes (singles and doubles) or should I instead specify them some other way? How about forward slash? I'm assuming backward slash just needs a single escape like so \\. I'd like to convert the slashes to dashes, so I need to account for them, otherwise I wouldn't bother and just leave it to the last sub to drop them. Damn, there's just a boatload of other characters I want to allow as well. Can the following be specified plainly (without escaping)... period, comma, colon, semicolon, question mark and the curly brackets { - and how about the shifted characters above the numerals with the exception or asterisk and carat?
|
Top
|
|
|
|
#318099 - 13/01/2009 21:57
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Many of those have special meaning inside a regular expression, so I suggest you try them one at a time when in doubt. The backslash will need to become a foursome (\\\\), and other known special characters include the dot, asterisk, dollar-sign, etc.. Here's part of the manpage: Regular Expressions
Regular expressions are the extended kind found in egrep.
They are composed of characters as follows:
c matches the non-metacharacter c.
\c matches the literal character c.
. matches any character including newline.
^ matches the beginning of a string.
$ matches the end of a string.
[abc...] character list, matches any of the characters abc....
[^abc...] negated character list, matches any character except abc....
r1|r2 alternation: matches either r1 or r2.
r1r2 concatenation: matches r1, and then r2.
r+ matches one or more r’s.
r* matches zero or more r’s.
r? matches zero or one r’s.
(r) grouping: matches r.
r{n}
r{n,}
r{n,m} One or two numbers inside braces denote an interval expression. If there is one number
in the braces, the preceding regular expression r is repeated n times. If there are
two numbers separated by a comma, r is repeated n to m times. If there is one number
followed by a comma, then r is repeated at least n times.
Interval expressions are only available if either --posix or --re-interval is specified
on the command line.
\y matches the empty string at either the beginning or the end of a word.
\B matches the empty string within a word.
\< matches the empty string at the beginning of a word.
\> matches the empty string at the end of a word.
\w matches any word-constituent character (letter, digit, or underscore).
\W matches any character that is not word-constituent.
\‘ matches the empty string at the beginning of a buffer (string).
\’ matches the empty string at the end of a buffer.
The escape sequences that are valid in string constants (see below) are also valid in regular
expressions.
Edited by mlord (13/01/2009 21:59)
|
Top
|
|
|
|
#318100 - 13/01/2009 22:06
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Here are some more examples: ## backquotes become apostrophes:
gsub("`","'")
## backslashes become dashes:
gsub("\\\\","-")
## slashes, question marks, carats, dollarsigns become underscores:
gsub("[/?^$]","_")
That last one above shows one way to deal with characters you aren't sure about: enclose them inside square brackets and they are no longer special (except for backslashes, dashes, or a leading carat).
|
Top
|
|
|
|
#318103 - 13/01/2009 22:46
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: mlord]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, here's what I have so far (I have yet to test this):
#!/usr/bin/gawk -f
{
## some html escapes:
gsub(">",">")
gsub("<","<")
gsub(""","\"")
gsub("&","\\&")
## replace fancy "smart" quotes with straight equivalents
gsub("’","'")
gsub("‘","'")
gsub("“","\"")
gsub("”","\"")
## backquote to apostrophe
gsub("`","'")
## double quote to apostrophe
gsub("\"","'")
## select illegal filename characaters replaced by alternates (other illegal characters just dropped later)
gsub(">",")")
gsub("<","(")
gsub("[:]"," - ")
gsub("[/]","-")
## backslash to dash
gsub("\\\\","-")
## double space to single space:
gsub(" "," ")
## sanitize the rest:
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","")
## dump it to stdout
print
}
What's an easy way to strip leading and trailing whitespace? That's about all that's left to do (just in case, but strictly for beautifying).
|
Top
|
|
|
|
#318104 - 13/01/2009 23:02
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Ok, it seems to work except that the smart quotes stuff I have isn't actually matching what's in the source. In the cygwin shell the offending characters come up as "â?T" which is sort of meaningless. Seems like a case of UTF characters... If I pipe the output to a file and then open it in a UTF-capable text editor on my Mac then they come up as the normal smart characters. If I save the awk file as UTF8 then it breaks when piped from the batch file. Is there a proper way to be able to use UTF8 in bash and awk? This of course also reminds me that I have to include accented characters as valid. I should have known this was going to get hairier...
|
Top
|
|
|
|
#318105 - 13/01/2009 23:03
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
gsub("^[[:blank:]]*", "") gsub("[[:blank:]]*$", "")
I hate POSIX regexes.
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318106 - 13/01/2009 23:09
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 25/12/2000
Posts: 16706
Loc: Raleigh, NC US
|
Is there a proper way to be able to use UTF8 in bash and awk? HAHAHAHAHAHA Uh, maybe. If your version of gawk is recent enough and has the right support built in, and you can set your LC_ALL and/or LANG environment variables to "en_US.UTF-8" (or something similar to that; your politics might require you to use "en_CA.UTF-8"), you might get it to work. Or you could just use perl (or Tcl or Ruby or Python or Forth or Haskell or whatever your pet language might be) instead.
Edited by wfaulk (13/01/2009 23:10) Edit Reason: americentrism
_________________________
Bitt Faulk
|
Top
|
|
|
|
#318107 - 13/01/2009 23:10
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Is there a proper way to be able to use UTF8 in bash and awk? Dunno. But if they're single-byte characters, then find their hexcodes and use: String Constants
String constants in AWK are sequences of characters enclosed between double quotes ("). Within
strings, certain escape sequences are recognized, as in C. These are:
\\ A literal backslash.
\a The “alert” character; usually the ASCII BEL character.
\b backspace.
\f form-feed.
\n newline.
\r carriage return.
\t horizontal tab.
\v vertical tab.
\xhex digits
The character represented by the string of hexadecimal digits following the \x. As in ANSI
C, all following hexadecimal digits are considered part of the escape sequence. (This fea‐
ture should tell us something about language design by committee.) E.g., "\x1B" is the ASCII
ESC (escape) character.
\ddd The character represented by the 1-, 2-, or 3-digit sequence of octal digits. E.g., "\033"
is the ASCII ESC (escape) character.
\c The literal character c.
The escape sequences may also be used inside constant regular expressions (e.g., /[ \t\f\n\r\v]/
matches whitespace characters).
In compatibility mode, the characters represented by octal and hexadecimal escape sequences are
treated literally when used in regular expression constants. Thus, /a\52b/ is equivalent to
/a\*b/.
|
Top
|
|
|
|
#318108 - 13/01/2009 23:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: wfaulk]
|
carpal tunnel
Registered: 12/11/2001
Posts: 7738
Loc: Toronto, CANADA
|
Bitt, is your example supposed to say "blank" ?
|
Top
|
|
|
|
#318109 - 13/01/2009 23:12
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
gsub("[^- 'a-zA-Z0-9 _$+&={}\\[\\]()%@!;,.]*","") Not sure about that one -- putting square brackets inside square brackets is ambiguous. Cheers
|
Top
|
|
|
|
#318110 - 13/01/2009 23:13
Re: bash scripting (xml parsing) help... (mainly awk and sed)
[Re: hybrid8]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14503
Loc: Canada
|
Bitt, is your example supposed to say "blank" ? Yes. But you could do it with real blanks, tabs, and newlines if you really wanted to. [:space:] Space characters (such as space, tab, and formfeed, to name a few).
Edited by mlord (13/01/2009 23:14)
|
Top
|
|
|
|
|
|