Given the following list of prices, I am trying to figure out how to normalize/extract only the digits.
INPUT DESIRED_OUTPUT
CA$1399.00 1399.00
$1399.11 1399.11
$1,399.22< 1399.22
Z$1 399.33 1399.33
$1399.44# 1399.44
C$ 1399.55 1399.55
1,399.66 1399.66
1399.77 1399.77
,1399.88 1399.88
25 1399.88 1399.88
399.99 399.99
88.88 99.99 99.99 (if >2 matches on one line, only the last one matters)
.1399.88 DO NOT MATCH (not a price; too many ".")
666.000 DO NOT MATCH (not a price: too many 0's)
I suppose it is a good idea to begin is with what they all have in common:
- Prices always contain
.NN, but never contain.NNN
Upon further inspection, other rules become apparent:
.NNmust be preceded by one or moredigits.NNN.NNcan be preceded by either,,, or a simpledigit, but nothing else.- Anything following
.NNand preceding*N.NNmarks the end of the match. - Finally, the regex needs to consider commas in things like
1,399.66(1399.66) to determine whether it is a price, but then strip them.1, 399.66, for instance does not equal1399.66: it should be399.66.
I am looking at sed, grep, and awk for a portable and efficient solution. How should I go about approaching this problem?
I found a similar question, but I have no idea how to try the following regex with sed:
^\d+(,\d{1,2})?$
EDIT: Yes, my input format is can be a little weird, because it is the result of the concatenation of scraped pages.