root/branches/feature-server/plagger/lib/Plagger/Plugin/Filter/EntryFullText.pm

Revision 1009 (checked in by miyagawa, 4 years ago)

r2680@rock (orig r938): miyagawa | 2006-06-09 10:28:17 +0900
add podtrac TruePermalink?. via http://d.hatena.ne.jp/mryfmo/20060608
r2681@rock (orig r939): miyagawa | 2006-06-09 10:30:48 +0900
TruePermalink?: add feedburner podcast redirector. Refs #226
r2682@rock (orig r940): miyagawa | 2006-06-09 16:11:35 +0900
use Last-Modified header to populate entry date, even if handler can't find one.
via http://subtech.g.hatena.ne.jp/otsune/20060608/norkdailymemo
r2683@rock (orig r941): miyagawa | 2006-06-09 16:12:52 +0900
take off utf-8 flag when taking digest value
r2684@rock (orig r942): miyagawa | 2006-06-09 17:04:38 +0900

Publish
CHTML: Don't die if body contains non-sjis mappable characters
r2685@rock (orig r943): miyagawa | 2006-06-09 17:26:01 +0900
defaults to cp932 would be better
r2686@rock (orig r944): miyagawa | 2006-06-09 17:37:37 +0900

r2687@rock (orig r945): miyagawa | 2006-06-09 18:48:15 +0900
add pya.cc upgrader via http://subtech.g.hatena.ne.jp/otsune/20060608/pya2feed
r2688@rock (orig r946): miyagawa | 2006-06-09 21:21:47 +0900
CustomFeed?
2chSearch
r2689@rock (orig r947): miyagawa | 2006-06-09 21:26:31 +0900
oops, remove </b>
r2690@rock (orig r948): miyagawa | 2006-06-09 21:44:42 +0900
fix date if it found true entry
r2691@rock (orig r949): miyagawa | 2006-06-09 21:59:05 +0900
need quotes
r2692@rock (orig r950): miyagawa | 2006-06-09 22:06:35 +0900
Planet: Scrubber support back inlib/Plagger/Plugin/Publish/Planet.pm
r2693@rock (orig r951): miyagawa | 2006-06-09 22:08:01 +0900
oops
r2694@rock (orig r952): otsune | 2006-06-09 22:11:04 +0900
fix extract http://pyc.cc/

r2695@rock (orig r953): otsune | 2006-06-09 22:12:28 +0900
add EntryFulltext? for seesaa blog

r2696@rock (orig r954): otsune | 2006-06-09 23:27:11 +0900
fix %3A

r2697@rock (orig r955): miyagawa | 2006-06-10 02:26:28 +0900
MixiDiarySearch?: decode keyword query
r2698@rock (orig r956): miyagawa | 2006-06-10 02:53:41 +0900
TruePermalink? enbug stuff. Use permlalink to find handlers
r2699@rock (orig r957): otsune | 2006-06-10 03:08:33 +0900
add EntryFulltext? http://headlines.yahoo.co.jp/

r2700@rock (orig r958): otsune | 2006-06-10 04:38:27 +0900
add Apple KB and TIL document

r2701@rock (orig r959): otsune | 2006-06-10 04:43:22 +0900
oops.

r2702@rock (orig r960): miyagawa | 2006-06-10 23:07:48 +0900
set Bloglines n=100
r2703@rock (orig r961): miyagawa | 2006-06-11 01:35:38 +0900
MixiDiarySearch?: allow no_photo.gif
r2704@rock (orig r962): miyagawa | 2006-06-11 01:45:53 +0900
2chSearh: Fix error handling
r2705@rock (orig r963): miyagawa | 2006-06-11 02:07:11 +0900
added takesako-san for his patch
r2706@rock (orig r964): otsune | 2006-06-11 05:59:58 +0900
modified Chugoku SHinbun, add EFT for http://www.zianplus.net/

r2707@rock (orig r965): otsune | 2006-06-11 10:17:02 +0900
add pMachine ExpressionEngine? http://www.pmachine.com/

r2708@rock (orig r966): youpy | 2006-06-11 12:38:21 +0900
fix regexp

r2709@rock (orig r967): otsune | 2006-06-12 04:09:24 +0900
fix extract regexp

r2710@rock (orig r968): otsune | 2006-06-12 04:13:19 +0900
update regexp

r2711@rock (orig r969): otsune | 2006-06-12 04:29:18 +0900
support http://www.mainichi-msn.co.jp/photo/etc/photo_feature/

r2712@rock (orig r970): otsune | 2006-06-12 06:08:15 +0900
fix wordpress.
Add mainichi-msn Photo and separate handle.
Add http://www.actiblog.com/

r2713@rock (orig r971): otsune | 2006-06-12 07:02:23 +0900
refine livedoorblog.pl
fix miss.

r2714@rock (orig r972): miyagawa | 2006-06-12 13:25:28 +0900
extract_title should be case insensitive. via http://d.hatena.ne.jp/sfujiwara/20060611/1150051152
r2715@rock (orig r973): miyagawa | 2006-06-12 13:39:12 +0900
rewrite config doesn't die even if it can't rewrite because of permission problem
r2716@rock (orig r974): miyagawa | 2006-06-12 13:43:25 +0900
skip all livedoorkeyword link
r2719@rock (orig r975): otsune | 2006-06-12 14:50:19 +0900
fix misc regexp

r2720@rock (orig r976): miyagawa | 2006-06-12 15:44:57 +0900
support handle only in livedoorblog.pl to work with aggregated feeds
r2721@rock (orig r977): miyagawa | 2006-06-12 18:22:40 +0900
TruePermalink? for blogpeople redirector
r2722@rock (orig r978): otsune | 2006-06-12 22:14:03 +0900
opps 'Unmatched ( in regex;'

r2723@rock (orig r979): youpy | 2006-06-13 10:21:42 +0900
add mailman upgrader


r2724@rock (orig r980): youpy | 2006-06-13 10:28:19 +0900
fix handle regexp


r2727@rock (orig r983): miyagawa | 2006-06-13 19:00:22 +0900
Subscription
Planet: add feedster.jp
r2728@rock (orig r984): miyagawa | 2006-06-13 19:06:06 +0900
use lang/all on feedster.jp
r2734@rock (orig r985): otsune | 2006-06-13 22:11:21 +0900
fix regexp

r2735@rock (orig r986): miyagawa | 2006-06-14 00:34:01 +0900
new plugin Notify
Beep
r2736@rock (orig r987): miyagawa | 2006-06-14 00:34:40 +0900
planet: remove unnecessary bit
r2737@rock (orig r988): miyagawa | 2006-06-14 00:35:03 +0900
update example to use sixapart-std
r2738@rock (orig r989): otsune | 2006-06-14 02:55:47 +0900
remove icon_re. RecentComment? can't get it

r2745@rock (orig r990): miyagawa | 2006-06-14 12:07:29 +0900
t/core is for developer test and not needed for installers
r2746@rock (orig r991): miyagawa | 2006-06-14 12:49:00 +0900
support mixi_tos_paranoia mode
r2747@rock (orig r992): miyagawa | 2006-06-14 13:10:40 +0900
title would be ok
r2792@rock (orig r993): miyagawa | 2006-06-16 15:04:12 +0900
  • New plugin Subscription::Bookmarks (and its IE subclass) to read IE favorites.
r2793@rock (orig r994): miyagawa | 2006-06-16 15:11:52 +0900
added TODO as comment
r2794@rock (orig r995): youpy | 2006-06-17 20:36:18 +0900
add Plugin::Subscription::Bookmarks
Safari


r2795@rock (orig r996): youpy | 2006-06-17 21:39:18 +0900
add tag support by folder name


r2796@rock (orig r997): youpy | 2006-06-18 15:41:59 +0900
use $uri->file when scheme is 'file'


r2797@rock (orig r998): youpy | 2006-06-18 15:42:56 +0900
add Plugin::Subscription::Bookmarks
Mozilla


r2798@rock (orig r999): miyagawa | 2006-06-19 15:23:13 +0900
bump URI
Fetch req
r2800@rock (orig r1000): miyagawa | 2006-06-22 00:26:46 +0900
dependency for Bookmarks
Safari. 1000th commit!
r2801@rock (orig r1001): miyagawa | 2006-06-22 00:30:57 +0900
fix config rewriting bug when the password contains regexp metachars. via http://d.hatena.ne.jp/sfujiwara/20060621/1150899012
r2802@rock (orig r1002): otsune | 2006-06-22 00:54:24 +0900
add http://www.computerworld.jp/ http://autopage.teacup.com/
fix headlines_yahoo_jp (Thanks woremacx)
fix goo blog

r2803@rock (orig r1003): miyagawa | 2006-06-22 01:10:00 +0900
import drawnboy's EntryFullText? yamls via http://svn.nowherenear.net/repos/public/misc/eft/
r2804@rock (orig r1004): miyagawa | 2006-06-22 01:10:39 +0900
update AUTHOR
r2805@rock (orig r1005): s_nobu | 2006-06-22 06:17:15 +0900
require HTML
Entities for enclosure support.

r2807@rock (orig r1006): miyagawa | 2006-06-22 15:46:30 +0900
URI
Fetch 0.07 is broken (i was a moron), reverting back to 0.06 for now
r2808@rock (orig r1007): miyagawa | 2006-06-22 16:04:48 +0900
packaging 0.7.3
Line 
1 package Plagger::Plugin::Filter::EntryFullText;
2 use strict;
3 use base qw( Plagger::Plugin );
4
5 use DirHandle;
6 use Encode;
7 use File::Spec;
8 use List::Util qw(first);
9 use HTML::ResolveLink;
10 use Plagger::Date; # for metadata in plugins
11 use Plagger::Util qw( decode_content );
12 use Plagger::Plugin::CustomFeed::Simple;
13 use Plagger::UserAgent;
14
15 sub rule_hook { 'update.entry.fixup' }
16
17 sub register {
18     my($self, $context) = @_;
19     $context->register_hook(
20         $self,
21         'customfeed.handle'  => \&handle,
22         'update.entry.fixup' => \&filter,
23     );
24 }
25
26 sub init {
27     my $self = shift;
28     $self->SUPER::init(@_);
29     $self->load_plugins();
30
31     $self->{ua} = Plagger::UserAgent->new;
32 }
33
34 sub load_plugins {
35     my $self = shift;
36     my $context = Plagger->context;
37
38     my $dir = $self->assets_dir;
39     my $dh = DirHandle->new($dir) or $context->error("$dir: $!");
40     for my $file (grep -f $_->[0] && $_->[0] =~ /\.(?:pl|yaml)$/,
41                   map [ File::Spec->catfile($dir, $_), $_ ], sort $dh->read) {
42         $self->load_plugin(@$file);
43     }
44 }
45
46 sub load_plugin {
47     my($self, $file, $base) = @_;
48
49     Plagger->context->log(debug => "loading $file");
50
51     my $load_method = $file =~ /\.pl$/ ? 'load_plugin_perl' : 'load_plugin_yaml';
52     push @{ $self->{plugins} }, $self->$load_method($file, $base);
53 }
54
55 sub load_plugin_perl {
56     my($self, $file, $base) = @_;
57
58     open my $fh, $file or Plagger->context->error("$file: $!");
59     (my $pkg = $base) =~ s/\.pl$//;
60     my $plugin_class = "Plagger::Plugin::Filter::EntryFullText::Site::$pkg";
61
62     my $code = join '', <$fh>;
63     unless ($code =~ /^\s*package/s) {
64         $code = join "\n",
65             ( "package $plugin_class;",
66               "use strict;",
67               "use base qw( Plagger::Plugin::Filter::EntryFullText::Site );",
68               "sub site_name { '$pkg' }",
69               $code,
70               "1;" );
71     }
72
73     eval $code;
74     Plagger->context->error($@) if $@;
75
76     return $plugin_class->new;
77 }
78
79 sub load_plugin_yaml {
80     my($self, $file, $base) = @_;
81     my @data = YAML::LoadFile($file);
82
83     return map { Plagger::Plugin::Filter::EntryFullText::YAML->new($_, $base) }
84         @data;
85 }
86
87 sub handle {
88     my($self, $context, $args) = @_;
89
90     my $handler = first { $_->custom_feed_handle($args) } @{ $self->{plugins} };
91     if ($handler) {
92         $args->{match} = $handler->custom_feed_follow_link;
93         return $self->Plagger::Plugin::CustomFeed::Simple::aggregate($context, $args);
94     }
95 }
96
97 sub filter {
98     my($self, $context, $args) = @_;
99
100     my $handler = first { $_->handle_force($args) } @{ $self->{plugins} };
101     if ( !$handler && $args->{entry}->body && $args->{entry}->body =~ /<\w+>/ && !$self->conf->{force_upgrade} ) {
102         $self->log(debug => $args->{entry}->link . " already contains body. Skipped");
103         return;
104     }
105
106     if (! $args->{entry}->permalink) {
107         $self->log(debug => "Entry " . $args->{entry}->title . " doesn't have permalink. Skipped");
108         return;
109     }
110
111     # NoNetwork: don't connect for 3 hours
112     my $res = $self->{ua}->fetch( $args->{entry}->permalink, $self, { NoNetwork => 60 * 60 * 3 } );
113     return if !$res->status && $res->is_error;
114
115     $args->{content} = decode_content($res);
116
117     # if the request was redirected, set it as permalink
118     if ($res->http_response) {
119         my $base = $res->http_response->request->uri;
120         if ( $base ne $args->{entry}->permalink ) {
121             $context->log(info => "rewrite permalink to $base");
122             $args->{entry}->permalink($base);
123         }
124     }
125
126     # use Last-Modified to populate entry date, even if handler doesn't find one
127     if ($res->last_modified && !$args->{entry}->date) {
128         $args->{entry}->date( Plagger::Date->from_epoch($res->last_modified) );
129     }
130
131     my @plugins = $handler ? ($handler) : @{ $self->{plugins} };
132
133     for my $plugin (@plugins) {
134         if ( $handler || $plugin->handle($args) ) {
135             $context->log(debug => $args->{entry}->permalink . " handled by " . $plugin->site_name);
136             my $data = $plugin->extract($args);
137                $data = { body => $data } if $data && !ref $data;
138             if ($data) {
139                 $context->log(info => "Extract content succeeded on " . $args->{entry}->permalink);
140                 my $resolver = HTML::ResolveLink->new( base => $args->{entry}->permalink );
141                 $data->{body} = $resolver->resolve( $data->{body} );
142                 $args->{entry}->body($data->{body});
143                 $args->{entry}->title($data->{title}) if $data->{title};
144                 $args->{entry}->icon({ url => $data->{icon} }) if $data->{icon};
145
146                 # extract date using found one
147                 if ($data->{date}) {
148                     $args->{entry}->date($data->{date});
149                 }
150
151                 return 1;
152             }
153         }
154     }
155
156     # failed to extract: store whole HTML if the config is on
157     if ($self->conf->{store_html_on_failure}) {
158         $args->{entry}->body($args->{content});
159         return 1;
160     }
161
162     $context->log(warn => "Extract content failed on " . $args->{entry}->permalink);
163 }
164
165
166 package Plagger::Plugin::Filter::EntryFullText::Site;
167 sub new { bless {}, shift }
168 sub custom_feed_handle { 0 }
169 sub custom_feed_follow_link { }
170 sub handle_force { 0 }
171 sub handle { 0 }
172
173 package Plagger::Plugin::Filter::EntryFullText::YAML;
174 use Encode;
175 use List::Util qw(first);
176
177 sub new {
178     my($class, $data, $base) = @_;
179
180     # add ^ if handle method starts with http://
181     for my $key ( qw(custom_feed_handle handle handle_force) ) {
182         $data->{$key} = "^$data->{$key}" if $data->{$key} =~ m!^https?://!;
183     }
184
185     # decode as UTF-8
186     for my $key ( qw(extract extract_date_format) ) {
187         next unless defined $data->{$key};
188         if (ref $data->{$key} && ref $data->{$key} eq 'ARRAY') {
189             $data->{$key} = [ map decode("UTF-8", $_), @{$data->{$key}} ];
190         } else {
191             $data->{$key} = decode("UTF-8", $data->{$key});
192         }
193     }
194
195     bless {%$data, base => $base }, $class;
196 }
197
198 sub site_name {
199     my $self = shift;
200     $self->{base};
201 }
202
203 sub custom_feed_handle {
204     my($self, $args) = @_;
205     $self->{custom_feed_handle} ?
206         $args->{feed}->url =~ /$self->{custom_feed_handle}/ : 0;
207 }
208
209 sub custom_feed_follow_link {
210     $_[0]->{custom_feed_follow_link};
211 }
212
213 sub handle_force {
214     my($self, $args) = @_;
215     $self->{handle_force}
216         ? $args->{entry}->permalink =~ /$self->{handle_force}/ : 0;
217 }
218
219 sub handle {
220     my($self, $args) = @_;
221     $self->{handle}
222         ? $args->{entry}->permalink =~ /$self->{handle}/ : 0;
223 }
224
225 sub extract {
226     my($self, $args) = @_;
227
228     if (my @match = $args->{content} =~ /$self->{extract}/s) {
229         my @capture = split /\s+/, $self->{extract_capture};
230         my $data;
231         @{$data}{@capture} = @match;
232
233         if ($self->{extract_after_hook}) {
234             eval $self->{extract_after_hook};
235             Plagger->context->error($@) if $@;
236         }
237
238         if ($data->{date}) {
239             if (my $format = $self->{extract_date_format}) {
240                 $format = [ $format ] unless ref $format;
241                 $data->{date} = (map { Plagger::Date->strptime($_, $data->{date}) } @$format)[0];
242                 if ($data->{date} && $self->{extract_date_timezone}) {
243                     $data->{date}->set_time_zone($self->{extract_date_timezone});
244                 }
245             } else {
246                 $data->{date} = Plagger::Date->parse_dwim($data->{date});
247             }
248         }
249
250         return $data;
251     }
252 }
253
254 1;
255
256 __END__
257
258 =head1 NAME
259
260 Plagger::Plugin::Filter::EntryFullText - Upgrade your feeds to fulltext class
261
262 =head1 SYNOPSIS
263
264   - module: Filter::EntryFullText
265
266 =head1 DESCRIPTION
267
268 This plugin allows you to fetch entry full text by doing HTTP GET and
269 apply regexp to HTML. It's just like upgrading your flight ticket from
270 economy class to business class!
271
272 You can write custom fulltext handler by putting C<.pl> or C<.yaml>
273 files under assets plugin directory.
274
275 =head1 CONFIG
276
277 =over 4
278
279 =item store_html_on_failure
280
281 Even if fulltext handlers fail to extract content body from HTML, this
282 option enables to store the whole document HTML as entry body. It will
283 be useful to use with search engines like Gmail and Search:: plugins.
284 Defaults to 0.
285
286 =item force_upgrade
287
288 Even if entry body already contains HTML, this config forces the
289 plugin to upgrade the body. Defaults to 0.
290
291 =back
292
293 =head1 WRITING CUSTOM FULLTEXT HANDLER
294
295 (To be documented)
296
297 =head1 AUTHOR
298
299 Tatsuhiko Miyagawa
300
301 =head1 SEE ALSO
302
303 L<Plagger>
Note: See TracBrowser for help on using the browser.